Practical Implementation Traits of Service Choreography Based Microservices Integration

| Comments

This is the second post in a three part series looking at the topic of microservice integration. In the first instalment, I focused mainly on the theory side of event-driven service choreography. In this second part, I’ll dig into the practical traits that we’ll require of our technical implementations that enable us to satisfy the theory discussed in the first post. In the final instalment of the series, I’ll look at different specific implementation techniques/technologies and how they map to the traits discussed in this post.

Implementation traits

I’d like to provide some coverage on what I believe to be the key traits we look for in service choreography based microservices integration implementation techniques/technologies. My goal is to set the scene to make it easier to test specific technologies against those traits (subject of the third and final post in this series).

I prefer to breakdown the traits into two categories: must-haves and nice-to-haves. The must-haves category contains traits that I believe are absolutely necessary in order to successfully apply the theory of service choreography. The nice-to-haves category contains traits that you can essentially live without, but can definitely buy you additional benefits. Like with most things, though, the decision to adopt nice-to-haves will be driven by context – we’re always making trade-offs, and you simply have to make judgement calls on a case by case basis.

Let’s move on to the first category of traits, the must-haves!

Must-have traits

Decoupled in time

This one is pretty straightforward. In the first instalment of this series, I discussed the asynchronous nature of service choreography based integration. Whatever implementation direction we go in, it needs to support us decoupling services in time. This means, for example, service A does not require service B to be online at a specific point in time (now) – we just need to ensure we have some mechanism in place for events from service B to eventually reach service A at some point in the future.

Guaranteed at-least-once-delivery

An absolute pre-requisite for ensuring eventual consistency is that we guarantee events eventually reach their interested consumers. But, why don’t we aim for exactly-once-delivery instead? I’m not going to repeat what many others have said before me, so suffice to say it’s simply not possible to achieve it in a distributed system. Google is your friend if you want to explore why :–)

So, we’re happy to settle for at-least-once-delivery because sending duplicates of a specific event is better than sending no event at all (that’s what you might see with at-most-once-delivery). The ability to guarantee at-least-once-delivery also implies the need for durability.

The biggest gotcha I see when it comes to at-least-once-delivery is what I’ve generally seen referred to as the dual-write problem. Whether you’re using a traditional CRUD approach, or you’re using eventsourcing, you are going to end up pretty unhappy if you have a unit of code that both writes to a datastore and delivers events to, say, a message queue. Let’s examine two ways I’ve seen this done:

Write to MQ after committing a database transaction
doInDatabaseTransaction { statement =>
  statement.insert("INSERT into ....")
messageQueue.publish(new SomeEvent(...))

Okay, so we make changes to the book of record (the database), the database transaction gets committed, and only once that happens, do we publish an event to the message queue where our interested consumers will be listening. This would work perfectly well in a world where nothing bad ever happens. But there are all sorts of things that can go wrong here, including the most obvious:

  1. Our application crashes immediately after the database transaction commits
  2. Our application is restarted immediately after the database transaction commits
  3. The message queue infrastructure is down for a few minutes, meaning we can’t send the event right now

In any of these cases, our book of record would be updated, but our downstream consumers would never receive the event. In one fell swoop, we’ve guaranteed that we’ll end up in an inconsistent state. You may say these circumstances are rare, and you’d be right, but Murphy’s Law – and our own experiences as software engineers – teaches us that if it can go wrong, it will go wrong. Guaranteed.

Let’s try another approach…

Write to MQ within scope of a database transaction
doInDatabaseTransaction { statement =>
  statement.insert("INSERT into ....")
  messageQueue.publish(new SomeEvent(...))

Hang on, that transaction boundary is bound to the database only; it’s got nothing to do with the message queue technology. In this example, if our database transaction were to rollback for any reason, our event would still have been published to the message queue. Oh dear, the event we sent out to interested consumers is not consistent with our book of record. Our consumers will proceed to behave as if the event has taken place, but our local context (the source of the event) will have no knowledge of it ever having taken place. That’s bad. Very bad. Arguably even worse than the first example.

Let’s get something straight here and now – unless we start dabbling in distributed transaction managers (e.g. XA standard) and two-phase commits, we can’t atomically update a database and write to a message queue. Two-phase commit is a disease we want to quarantine ourselves from forever, so we need another way to escape from the dual-write problem. You’ll have to wait until the third part of this mini-series for a solution ;–)

Guaranteed message ordering

The need for a stream of events to be consumable in order really depends on the use cases of its consumers. Very broadly speaking, there are two categories of consumer use case:

  1. Consumers that consume events from another service where consumption results in state transitions in the local context (e.g. projecting state locally from an external bounded context). Such consumers are conceptually stateful in that they care about tracking state across a series of multiple related events over time. In such cases, it’s usually necessary to process the events in the order they were originally produced for the local state to remain consistent with the source. It’s important to emphasise that it is only related events for which the order is necessary (e.g. events emanating from a specific aggregate instance).

  2. Consumers that are conceptually stateless in that they can treat every event they encounter as if it’s completely unrelated to any other event they’ve encountered in the past. Such consumers will typically trigger some kind of one off action, such as sending an email, sending a push notification, or triggering an external API call. An example of this might be where the reaction to an event requires charging a credit card via a third-party payment gateway.

Given that service choreography will inherently lead to many instances of use case 1) in your services, it becomes inevitable that you make implementation choices that allow events to be consumed in the order they were produced. With this in mind, it makes sense to choose implementation techniques/technologies that provide this guarantee, even if some of your consumers don’t rely on ordering.

Guaranteed at-least-once-processing

Well, I guess what we really want is exactly-once-processing! However, I thought it would be helpful to write a separate subsection on idempotency (see below). I find it useful to separate the general action of processing from the outcome of the processing – even if we handle a message/event idempotently (e.g. through some method of deduplication), I still like to consider that the message/event has been processed, despite the absence of any side effects. I find it simpler to think of processing as meaning a consumer has handled a message/event and is now ready to handle the next one in the stream.

It’s really important to emphasise the word ‘eventual’ in eventual consistency. Whilst it seems obvious, I have seen people neglect the fact that eventual does mean that something will definitely happen in the end. Yes, we acknowledge that consistency may be delayed, but we still rely on consistency being achieved in the end. Where we’re going down the microservices path – and following the service choreography approach – we need, in many cases, cast iron guarantees that we’ll eventually process every event we’re interested in. For example, if we are projecting state locally (achieving autonomy and encapsulated persistence) based on events produced by another service (bounded context), and our local business logic relies on that state, we can have zero trust in the entire system if we can’t guarantee that we’ll successfully process every event we’re interested in.

A murky subtext here is how to deal with processing errors. Whatever the reason for an error during handling of an event, you are forced to consider the fact that, if you continue to process further events without processing the event raising the error, you could leave your system in a permanently inconsistent state. Where it’s absolutely necessary for a consumer to handle events in order, you really are forced to block all subsequent processing until you’ve found a way to successfully process the event that’s raising an error. There’s an obvious danger here that your target SLA on eventual consistency could be quickly blown out the water if, for example, the solution to the failed processing involved code changes. As discussed above, ordering is rarely a requirement across every event in a stream. With this in mind, the ability to achieve some form of parallelism in event handling may well be necessary to avoid complete gridlock in a specific consumer. I’ll discuss this in the nice-to-haves section.

Where the requirement to process events in order can be relaxed, dealing with processing errors can be a little more straightforward. An option might be to log the event raising an error (after exhausting retries), and move on to subsequent events in the stream. You could put in place some mechanism to replay from the error log once necessary work has been carried out to ensure the event can be successfully processed.

In some circumstances, it may even be ok to never process an event. For example, consider an email notification use case. Given that processing failure rates are likely to be pretty low in normal operation, you may deem it acceptable for the odd system email to never reach an intended customer.


Given the inability to achieve exactly-once-delivery, and instead falling back to at-least-once-delivery, we can’t just ignore the fact that consumers will, on occasion, encounter the same event more than once. Idempotency is a property of an event handler that allows the same event to be applied multiple times without any new side effects beyond the initial application. In some cases, it might be ok to live with repeated side effects, and in some cases it won’t be ok. For example, we might not mind if we send a duplicate email, but a customer won’t be too happy if we charge their credit card twice for the same order.

Some actions are naturally idempotent, in which case you don’t need to explicitly worry about duplicate application, but there are many cases where it’s going to matter, and so you need to introduce mechanisms to avoid duplicate application. I’m going to resist exploring patterns for idempotent event handling in this series of posts, as it warrants dedicated coverage of its own. Mechanisms for implementing idempotency are typically application level concerns, rather than, for example, being something you can rely on some middleware layer to handle for you. Whatever implementation mechanisms you choose to integrate services via asynchronous events, you’ll need to deal with ensuring idempotency in the way you handle the events.

On a side note, it’s worth mentioning that some third-party, external services you integrate with may give you some help in this area. For example, Stripe’s API supports passing an ‘idempotency key’ with a request, and it guarantees that, in a 24 hour window, it won’t reprocess two API calls that share the same key.

Nice-to-have traits

Consumer-side failure recovery

I was very close to including this trait within the must-haves group, but decided to be lenient. Now that we understand autonomy to be a key attribute for reactive microservices, it follows, in my opinion, that consumers must be responsible for recovering from their own failures without burdening upstream sources of events. I’ve worked with message oriented systems where a producer of events is relied upon to re-dispatch messages in the event a downstream consumer has got itself in a mess. It strikes me that such an approach is not compliant with the autonomy objective – if a consumer is dependent on a producer going beyond its operational responsibilities to help it recover from failure, the autonomy of that consumer is called in to question.

This trait drives an alternative way of thinking from more traditional forms of middleware and/or integration patterns. In the third part of this series of posts, I’ll look at how distributed commit log technologies (such as Apache Kafka and Amazon Kinesis) have a considerable advantage over traditional MQ and pub/sub technologies in regard to this nice-to-have integration trait. It boils down to inversion of control, whereby the responsibility for tracking a consumer’s progress through a stream of events becomes the responsibility of the consumer rather than a central messaging broker.

Decoupled in space

In the must-haves section, I covered the trait of integration being decoupled in time. Going a stage further, you can aim for services to be decoupled in space as well. Anyone who has worked with a service-oriented architecture, especially where synchronous integration between services is the norm, will be familiar with the challenge of service addressability. Dealing with the overhead of managing configuration for many service endpoints can be quite a burden.

If we’re able to remove this overhead in some way, thus achieving significant location transparency, it can further simplify our service integration challenges. Using middleware technology is a great way of achieving this. Decoupling in space is also possible without middleware – contemporary service discovery/locator patterns do facilitate this to some extent – and I’ll weigh up the two approaches in the third and final post of this series.


In an ideal world, we’d want the ability to parallelise the processing capabilities of a specific consumer by starting multiple instances. A common pattern when using messaging middleware is to have a single queue with multiple consumers each being sent messages in a round-robin fashion, with no consumer receiving the same message. This approach works fine in scenarios where processing messages in order is not important. However, as discussed in the must-haves section, we’ll often encounter the need for a consumer to process a stream of events strictly in order, especially when applying service choreography based integration. As also discussed earlier, it’s rarely the case that a consumer cares to receive every event in order, more likely it’s important that events that are related in some way to each other are processed from a stream in the order they were generated (e.g. events emanating from a specific aggregate instance). With this in mind, it’s a nice-to-have to find a way to parallelise consumers, whilst still ensuring events related to each other are processed in order. By doing this we get these primary benefits:

  1. We can improve the performance of our system through horizontal scaling, reducing the latency of eventual consistency.
  2. It’s easier to implement high availability of consumers rather than have single points of failure.
  3. We can avoid a consumer use case being completely blocked when encountering a repeated error in processing a single event. If we’re able to parallelise in some way, we can at least have that consumer use case continue processing some events (as long as they aren’t related to the stubborn one) rather than stopping processing altogether.

In the third part of this series of posts, I’ll look at the technology options available to us that enable both guaranteed in-order processing and parallel consumers.

Wrapping up

Phew, that’s the end of a long post! I’ve covered both the must-have traits and the nice-to-have traits of microservices integration implementations that are supportive of service choreography. In the third and final post of this series, I’ll at last get round to looking at specific technologies and techniques that enable us to satisfy these traits. Stay tuned!

The Art of Microservices Integration Using Service Choreography

| Comments

This is the first post in a three part series looking at the topic of microservice integration. In this first instalment, I’ll be focusing mainly on the theory side of event-driven service choreography. In the second post, I’ll cover the implementation traits required to satisfy the theory discussed in this first post, and, in final post, I’ll be assessing the support available for those traits in well known implementation technologies. So, let’s get on with part one!

Looking back

One of the biggest shortcomings of traditional SOA is/was the tendency to break up a highly-coupled monolith into a series of smaller services with the same level of coupling that was previously internal to the monolith. The likely result being a distributed monolith with all the same problems you had before, but now with an additional operational burden – you end up in a worse position than if you’d just stuck with the monolith!

It’s this learning that I believe was the main catalyst for the development of the microservices architecture pattern (SOA 2.0?). In hindsight it seems pretty obvious; if you can’t run a service in isolation, with significant levels of autonomy, it’s pretty hard to justify why a piece of functionality is better in a separate service than simply internal to a monolith. It’s not like there aren’t some potentially good reasons why people would choose to do old style SOA, it’s just you find those good reasons are outweighed by the negatives most of the time.

Looking forward

So, enter the world of microservices architecture, and the promise of isolation, autonomy, single-responsibility, encapsulated persistence etc. But, what exactly allows one to achieve such isolation and autonomy?

In this post, I’m going to focus on how getting service integration right is a fundamental part of what enables services to be isolated, and act autonomously. I believe having a thorough understanding of the principles of good service integration can train your mind into grasping the importance of good service separation. As a little bonus, I’ll close this post by looking at some considerations for good service separation.

Service orchestration vs service choreography

In a classic distributed monolith scenario as described above, the prevalent integration technique is likely to involve service orchestration. This is where backend services typically have a high-level of synchronous coupling – i.e. a service is reliant on other services to be operational and working, within a single request/response cycle, in order for it to carry out its own responsibilities. Such real-time dependencies prevent a service from acting autonomously – any failures in its dependencies will inevitably cascade, preventing the service from fulfilling its own responsibility. Here’s a visualisation of service orchestration in action:

In this example, service A is dependent on both service B and service C when handling its own inbound HTTP calls. Service A fetches state X (from service B) state Y (from service C), aggregating them with some of its own state, Z, to finally yield an HTTP response to its callers. Failures in either B or C will prevent A from fully fulfilling its responsibilities – it may be able to degrade gracefully but the ability to do so really depends on context.

Contrast this scenario with service choreography instead. In a system designed to embrace choreography, services will typically avoid synchronous coupling – i.e. any integration between services does not apply during the usual request/response cycle. In such cases, a service can fulfill its responsibilities within the request/response cycle without the need to make further calls to other services (with the exception of persistence backends owned solely by the service).

The classic way to achieve this is via embracing an event-driven (message passing) approach to integration. That is, any state that a service requires from services external to it, is projected internally from event streams published by those external services. Such internal projections will be managed and updated completely asynchronously (outside of the request/response cycle), and will be eventually consistent. In the true spirit of microservices, the service entirely encapsulates all the persistent state it requires in order to fulfill its responsibilities, achieving true isolation and autonomy.

Let’s refactor the previous diagram to reflect choreography instead of orchestration:

In this updated approach, service A receives events from streams published by service B and service C. Service A processes events in an eventually consistent manner, and persists locally only what it needs of state X and Y, alongside its own state, Z. When handling an incoming request, service A does not need to communicate with service B or service C.

Clarifying asynchronous integration in service choreography

I feel there’s some confusion where people refer to asynchronous communication, especially in the field of microservices integration. It’s worth some time to clarify what’s meant in the context of service choreography.

What it’s not:

  • Non-blocking I/O – I’m an advocate of asynchronous, non-blocking I/O as part of building more efficient, resilient and scalable services that interact in some way with external I/O. However, in the context of service choreography, this is certainly not what we mean by asynchronous integration. Non-blocking I/O could still be used within the request/response cycle for orchestration use cases, and, whilst it has its advantages in one sense, certainly doesn’t, on its own, buy any architectural benefits of isolation and autonomy.

  • Classic MQ Request/Reply – It’s possible using classic MQ technology (e.g. JMS, AMQP) to achieve asynchronous request/reply behaviour. You could pop a message on a queue, and wait for a response on some temporary reply queue. There’s certainly some added decoupling in that the caller needn’t know exactly who will reply, but, like with non-blocking I/O, if this is being done as part of a service handling an incoming request, then, despite the communication with the MQ itself being asynchronous in nature, the service is still not acting autonomously. If a consumer responsible for replying is down, and the call must then timeout, it’s ultimately no different to an HTTP endpoint being unavailable or failing.

So, to clarify things then. When we’re talking about services being integrated asynchronously as part of service choreography, we’re referring to a service being free from the need to rely on its dependencies during the request/response lifecycle.

End-to-end autonomy

Where I’ve covered isolation and autonomy in this post, I’ve been referring to runtime autonomy. It’s worth noting that a strong motivation for isolating services is the autonomy they additionally afford throughout the entire engineering process. The event-driven integration nature of service choreography sits very naturally with the desire to assign clear ownership to specific teams, enabling them to develop, build, test and release independently. Whilst there are techniques to support these things alongside service orchestration, I find it much easier to reason about isolation and independence when service dependencies are largely confined to the background.

When considering service/API level tests, embracing eventually consistent, event-driven integration allows you to focus on data fixtures as opposed to service virtualisation. There’s something inherently more simple – in my mind – about placing a system into a desired state via a simulated event stream, rather than having to worry about mocking/stubbing/virtualising some external service endpoint(s).

Added complexity?

Service choreography, without doubt, introduces a level of technical complexity beyond orchestration. Rather than the apparent simplicity of making calls to dependencies in real-time, you need to both produce events yourself and consume events from others; it requires a fundamental shift in the way you build software. It exposes you to such challenges as guaranteeing at-least-once-delivery/processing of events, handling out of order events, ensuring idempotency in event consumers, factoring in eventual consistency in the UI etc.

Like with anything in software engineering, it’s all about tradeoffs. There’s no silver bullet, and you have to make a call based on your own unique circumstances. There will always be times where the simplicity of orchestration will trump its limitations on autonomy. Making such calls is why experience matters and why there will always be room for judgement in engineering.

For example, a team may decide that it’s ok for services acting as companions to some primary service – and so invisible outside the context of the service’s public contract – to use orchestration, reserving choreography for integrating with services owned by other teams. In this scenario, the team would still benefit from the greater independence they’ll have from other teams, whilst losing some runtime autonomy internally.

Additionally, there are times when integrating with third-party services (outside your organisation) will necessitate a degree of orchestration given limitations in the third-party API contracts.

As a general system wide architectural constraint, it’s wise to be rigid about enforcing service choreography between services owned by different teams. This has the added benefit of really helping to drive home the importance of good service boundaries – if your design relies on orchestration between services owned by different teams, there’s a good chance you’ve not found sensible business-oriented boundaries between your services. A word of caution, though: where services are clearly aligned with business boundaries, choreography should be the preferred approach, even if the services are owned by the same team.

Touching on boundaries

Whilst service choreography enables autonomy, the elephant in the room is that a dependency is still a dependency, whether it be event-driven or not. If a service is required to consume from a large number of event streams, it’s still adding overhead in terms of managing the dependencies over time.

One of the fundamental mistakes people make with microservices is to draw technical boundaries rather than boundaries of business capability/purpose. The use of the word “micro” is surely partly to blame, as it may appear to encourage highly granular service responsibilities. By breaking up a monolith into services of a highly technical nature, with little correlation to the business domain, you’re inevitably going to introduce more dependencies, and accordingly more overhead. Too many dependencies of any variety is almost certainly a design smell, a sign of high coupling and low cohesion, when it’s the exact opposite we’re looking for. If you’re stuck with lots of dependencies, it’s a sure fire way of spotting that you’ve drawn your boundaries wrong. If you’ve managed to identify a genuine business capability, you’ll be surprised at its fairly natural properties of isolation and independence.

Domain-Driven Design equips us with the toolset – and mindset – to identify business-oriented boundaries, and there’s certainly reason to see a close relationship between microservice boundaries and Bounded Contexts as described in DDD. Whilst I don’t consider there to be a direct 1:1 mapping in every circumstance (a good subject for another post), there are definitely some parallels to draw.

One way to go about reasoning about the need to introduce a dependency on an external event stream is to pedantically question the purpose of having that dependency in the first place. You may begin to find that you’re tending to introduce such dependencies for the purpose of presenting data in a UI rather than for fulfilling specific business logic within your service boundary. In such cases, it can be preferable to rely on data aggregation/composition in the UI layer (where it’s generally a little more acceptable/inevitable to rely on orchestration). When applying Domain-Driven Design to model business complexity, it’s advisable to be ruthless about business purpose within any Bounded Context, and that means avoiding projecting state from external services if your service/domain doesn’t really need it.

Coming next

That wraps up part one of this three part series. I’ll be following up soon with part two within which I’ll cover the implementation traits required to satisfy the theory discussed in this first post. Stay tuned!

Twelve-factor Config and Docker

| Comments

Config confusion

I recently wrote about what I see to be confusion over the way the Twelve-factor app guidelines are interpreted with regard to app config. You can read that post here.

To summarise my argument, I think people tend to focus solely on the explicit guidelines in the Config section, and overlook the additional advice – specific to config – given in the Build, release, run section. The former simply speaks about reading app config from the environment, the latter clearly states that a software ‘Release’ is a uniquely versioned combination of a ‘Build’ and ‘Config’.

Suffice to say, I’m quite surprised at the extent to which people manage to overlook the Build, release, run section when discussing Twelve-factor config. It offers extremely specific advice with regard to managing immutable release packages, and I don’t believe it’s correct to claim you’re doing Twelve-factor style config if you’re not also following the Build, release, run guidelines.

In this post, I want to address the implications of this view with regard to shipping applications in Docker containers. Once again, I see some conflict in what Twelve-factor has to say about config and perceived best practices for Docker.

The Docker way

I’ve digested a whole bunch of various opinions and best practices with regard to Docker, and a fairly consistent view is that containers should remain environment agnostic – i.e. the same container you generate at ‘Build’ time should be deployable to any environment.

I get this, and I’m in total agreement. There’s certainly agreement here with Twelve-factor, at least in terms of what constitues a ‘Build’. So, how would we supposedly do Twelve-factor config with this model? It appears to be quite simple as Docker lets us pass in environment variables when running a container, e.g.

docker run -e FOO=bar coolcompany/coolapp:1.0.0

This is saying, run coolapp with tag 1.0.0 passing in bar as the value for environment variable FOO. The Docker tag, in this fictitious example, is meant to represent the ‘Build’ version of the app, and would have been generated during the build phase in the delivery pipeline.

This approach is absolutely consistent with the Twelve-factor Config section – our application (encapsulated in the container) will read its configuration from the environment variable(s) provided. And, of course, we haven’t tied the container image to a specific environment – this container looks very much like what Twelve-factor refers to as a ‘Build’.

Hold on, though. Whilst we’ve satisfied the Config section, we’ve only partly satisfied the Build, release, run section. In fact, I’d go as far as saying that this is violating the Build, release, run guidelines.

Let’s take some quotes directly from the Twelve-factor guidelines:

The release stage takes the build produced by the build stage and combines it with the deploy’s current config. The resulting release contains both the build and the config and is ready for immediate execution in the execution environment.


Every release should always have a unique release ID, such as a timestamp of the release (such as 2011-04-06-20:32:17) or an incrementing number (such as v100). Releases are an append-only ledger and a release cannot be mutated once it is created. Any change must create a new release.

In our example above, I think it’s fair to say that this advice has been circumnavigated. We’ve taken our ‘Build’ and jumped straight to ‘Run’, altogether ignoring what Twelve-factor refers to as a ‘Release’. We’ve not created a uniquely versioned, immutable release package and we’ve burdened the ‘Run’ phase with the additional responsibility of having to pass environment variables to the container. The ‘Run’ phase has become more complicated than it should be.

This approach has maintained a distinct separation between code and config, whereas Twelve-factor very explicitly specifies that a ‘Release’ is a combination of code and config. The Twelve-factor approach allows the ‘Run’ phase to be dumb – it just launches whatever package you give it, needing no knowledge of application specific configuration. And, it naturally follows that rollbacks are a simple case of running the previously versioned release, with no need to worry about what the configuration for that version should be.

An alternative approach

This is where this post is bound to get murky and upset a few people. I’m going to be heretical and suggest a model whereby we do create environment specific Docker containers. I can hear the cries of “How very dare he?!”

I propose the idea of taking our base ‘Build’ image and creating a uniquely versioned ‘Release’ image as a thin layer on top of it – oh the joys of image layering. This new image does embed the environment variables – specific to a chosen environment – within itself, rather than requiring they be passed to docker run on launch.

Let’s look at an example Dockerfile to achieve this:

FROM coolcompany/coolapp:1.0.0

We can then use this Dockerfile to build the ‘Release’ image, giving it a unique version at the same time, e.g. coolcompany/coolapp:1.0.0-staging-v11.

I’ve made up a convention here {build_version}-{environment_name}-{release_number} for tagging releases. Including the environment name in the tag might be a nice way of ensuring it’s clear which environment the container is tied to.

So, our delivery pipeline continues to produce an environment agnostic container ‘Build’ image, but, just at the point of deployment to our chosen environment, we create a new environment specific image and use this as our ‘Release’. Then, the ‘Run’ phase need only be given the ‘Release’ image version in order to execute the application.

This model sees ‘Release’ packages created on demand – i.e. a ‘Release’ package (Docker image) is created just in time at the point of deployment to a specified environment. From where environment variables are actually sourced and added to the Dockerfile, is beyond the realms of this post.

The right way?

I’ve read enough so called best practices to expect this approach to anger some Docker/containerization purists. However, I genuinely see this as being a reasonable way to implement the Twelve-factor guidelines using Docker.

If not this approach, then what? For me, one reasonable way to challenge this model would be to challenge the whole Twelve-factor concept of Build, release, run. If you disagree with the Twelve-factor concept of a ‘Release’, then by all means disagree with the content of this post!

Just like with my related post – and despite being sympathetic to the Build, release, run advice – I’m not necessarily arguing right or wrong here. It’s just a case of pointing out what would constitute a pure implementation of the Twelve-factor guidelines on top of Docker.

Remember, the Twelve-factor guidelines were essentially invented by the Heroku gurus, and there are other PaaS technologies that also follow the same principles. It’s just a specific way of tackling release management, and, whilst it may not be the right way of using Docker, I don’t think it would be fair to say it’s wrong either.

What do you think?

Customer Experience Makes the Difference

| Comments

In this post, I’m going to dip my toes into the world of customer experience (CX). This isn’t a subject I’ve written about before, but a recent sub-optimal hotel experience during a trip to Belgium has prompted this analysis.

Whilst this story does not relate specifically to technology, great customer experience is something that all businesses should pursue, whether technical or not in nature.

Our story

Back in early December, my wife and I booked a two night city break to Bruges (Belgium), to return for the third time to what will remain an unnamed hotel, a hotel that continued to rank amongst our top five hotels in the world. To beat those New Year blues, we’d been looking to go away sooner, but delayed until the beginning of February due to the hotel having no availability for the entire month of January. With no specific explanation – within the online booking engine – for the complete lack of availability in January, it was natural to assume they were just fully booked.

A few days prior to our trip, my wife noticed on their website an explanation for the January blackout – turns out they were shutting the hotel for refurbishment. But, no worries, as the work would be over ready for our arrival.

On arrival at the hotel, it was immediately obvious that the work had not been completed – there were decorators inside and out, hammers banging, paint pots piled up in the hallways. During check-in, the desk staff made no reference to the refurbishment work that was going on around us, and we were eventually shown to our room.

To a backing track of relentless hammering outside the window, my wife asked our room escort about the Wellness Centre – the pool, jacuzzi, sauna etc. being one of the considerable advantages of this particular hotel. It was at this point that a member of staff finally acknowledged what was going on around us and this included breaking the news that the Wellness Centre was out of use. It turned out the refurbishment work had overrun (no s**t, Sherlock) and was expected to carry on for a further week.

Following a fairly short discussion, my wife and I concluded that we should move to a different hotel. Why?

The hotel failed at every possible opportunity to acknowledge the problem

The hotel had at least three clear opportunities to proactively acknowledge the problem:

  • At the point of booking – they could have at least warned of the potential risk of overrun
  • In the days leading up to our arrival – a simple email would have been courteous
  • On arrival – they could have immediately apologised at check-in and offered solutions

The most damning thing is that we ultimately had to prise an apology out of the staff. If I’d been managing the hotel, I’d have personally welcomed every arriving guest, explained the issue and discussed ways to minimise the disappointment for them. It’s a customer experience disaster that we were left to realise what was going on before the staff acknowledged it.

The hotel had devised no proactive mitigation plan

What’s pretty clear is that the hotel staff, as a team, had failed to prepare for the fallout. Let’s be honest, it’s not difficult to predict that refurbishment works will overrun – not only should the hotel manager have prepared a mitigation plan, but that plan should have been communicated clearly to every member of staff.

They should have been ready for every single pre-booked guest arrival and had devised an individual, tailored mitigation solution for each one. This plan could have been communicated to the guest(s), at best, prior to arrival, and, at worst, immediately on arrival at the reception desk.

A great manager would have got every member of staff into a room, rallied the troops and ensured each team member was fully prepared to execute the mitigation plan. Unfortunately, the staff at our hotel seemed as bemused as we did, and this was extremely disappointing.

What can be learned?

So, given the clumsy nature of their handling of the situation, we checked ourselves out the hotel and found ourselves a room elsewhere (the first time we’ve ever done that). The hotel that we’d previously held so dear to our hearts, had undone years of customer loyalty development in a single day. As it turns out, we shouldn’t have ignored the fact they’d tumbled down the TripAdvisor rankings in the past few years – the warning signs were there.

Still, there’s a good chance we’d have forgiven them if they’d demonstrated a desire to minimise our disappointment – we’d probably have stayed where we were. Instead, not only are we unlikely to ever go back, but we’re now unlikely to recommend anyone else to stay there either (and they’ve had guests in the past directly off the back of our recommendations).

The lessons that can be learned from this experience are hardly original:

A good product is worth little without great customer service

The tangible product – ignoring the fact that some of it wasn’t available at the time – was seemingly unchanged. This is still a fabulous looking boutique hotel, in a fabulous location. The rooms are beautiful and the breakfasts fantastic. But, that just isn’t enough without delightful customer service to go along with it.

Never ever ever think your tangible product will make up for lapses in customer service. There’s no difference between product and service – your product includes the service.

Don’t rest on your laurels

You can’t rely on your customers’ loyalty through past experience. Just because you’ve won their loyalty in the past, it doesn’t mean you can screw up in the present. Loyalty is a pretty fickle concept and, if you mess with your customers, they’ll look elsewhere for a better all round experience.

Customer service is the most important differentiator

As a market becomes more and more competitive, the opportunities to differentiate on your tangible product are reduced – it’s not always easy to create a truly original product. So, whilst your competitors continue to iterate blindly on their tangible product, spread your resources to accommodate improving all round customer experience. Service your customers in delightful ways, and they’ll keep on coming back to you for more. Do not underestimate your ability to differentiate yourself through great service.

Make the difference

As stated at the beginning of this post, customer experience is important for every business – little or large, technical or non-technical.

Great customer experience makes the difference, so make it your difference.

A Confusing Side to Twelve Factor App Configuration

| Comments

I’m sure I’m not the only one who breathed a sign of relief when the clever brains behind Heroku published the Twelve Factor App guidelines. Here was a reliable set of principles – born out of real life experience – that immediately hit home, providing sensible advice for overcoming the many pitfalls developers and operations teams have fought relentless battles with time and time again.

One of the principles that I was especially able to connect with is the advice regarding from where applications should read their configuration. Having regularly encountered config file hell over the years, this simple, platform agnostic approach to supplying config to an application makes a whole lot of sense. There have, though, been some misunderstandings with regard to this advice, and this blog post by Kristian Glass does a good job of highlighting one such misunderstanding – Twelve Factor does not dictate from where the environment should be populated, only that an app should read from it.

So, with a major misunderstanding out the way, we are left with a whole bunch of options as to how we populate the environment. Outside of the extreme abstraction of a PaaS environment like Heroku, something like Consul – the distributed key/value store – is one such solution. And in this blog post by Hashicorp, that very solution is covered quite nicely.

But, hang on, has something been overlooked?

I’d like to challenge Hashicorp’s claim that this approach is compatible with Twelve Factor principles. My personal opinion – on which I’m very happy to be challenged – is that there is a further critical piece of advice, regarding configuration, contained within the Twelve Factor guidelines – you just have to turn to a different page.

This additional piece of advice is contained with the section titled Build-Release-Run. As far as I’m concerned, the advice here is pretty crystal clear – a ‘Release’ is a combination of a ‘Build’ and ‘Config’. This immutable artifact is uniquely versioned and is deployed and rolled back as an atomic unit. This is how Heroku does it, and the open source Docker based PaaS Deis has the same approach. I’m leaving out other PaaS tech that also follows this model.

It would be hard to deny that this approach makes deployments very easy to reason about. By bundling together code alongside environment specific config, you simplify release management and tracking. For example, if you need to rollback, you don’t have to manage that as two separate actions – it’s an atomic action and will return you to the last known uniquely versioned, operationally sound combination of code and config.

So, whilst it’s true that Twelve Factor does not dictate from where the environment should be populated, it’s pretty clear that it does say when it should be. The source of environment configuration should be read when preparing a release, and that any change to config, just like with a change to code, results in a new release.

Therefore, any approach to managing configuration, in a similar way to Hashicorp’s advice, would not appear to be compatible with the Twelve Factor guidelines. They are promoting a model where the ‘Build’ and ‘Config’ are not atomically bundled as a ‘Release’ and, as such, this model is violating a fundamental Twelve Factor principle – if your code and config is managed in separate lifecycles, it ain’t Twelve Factor compatible.

One could quite easily challenge the ‘Build + Config = Release’ advice:

  1. It doesn’t appear to leave any room for runtime config changes
  2. It’s not completely clear how it would work with dynamic service discovery

Sometimes, however, the advantages of predictable, easy to reason about deployments outweigh the benefits of such niceties

I’m not debating here what’s the right or wrong way, I’m just pointing out that the Twelve Factor advice is very clear about the meaning of a ‘Release’, and, therefore, any method that circumvents this, cannot claim to be compatible with the guidelines.

Just saying.

UPDATE 2015-01-21 20:10

Spring are also claiming that the Spring Cloud approach to configuration is Twelve Factor compatible in this blog post. I think this is making the same mistake as Hashicorp. If configuration is able to change independently from a release – as is described in this post – then it’s not in the spirit of the Twelve Factor App.

Lessons From My First Startup Failure

| Comments

It’s been a while since I last posted here, and good reason – I’ve been trying to run a startup! As this is a technically focused blog – and given my startup experiences were mainly commercially focused – I didn’t have all that much to say here. But, now I’ve experienced my first startup failure, I thought I’d write about it here (even though it’s not really technical content).

It has become somewhat customary for entrepreneurs to write about the good times and the bad. In either case – often even more relevant in the case of the bad times – it seems quite useful to share with others the lessons one learnt along the way. So, this is me sharing some of the lessons I learnt during my first startup failure experience. Let’s get on with it.

Be wary of complex dependencies

If your business model is reliant on others (partners) investing in changes to their own technology systems, your chances of success are considerably lowered.

All startups have an implicit dependency on customers buying their product/service. Any additional dependencies just hugely complicate things. Try to keep as much within your own control as possible.

Take good feedback with a pinch of salt

Most people will tell you how great your idea is – it’s human nature. The only way to be sure you’ve got a good product is by getting someone to actually pay you. Until someone actually hands over money for your product/service, you haven’t validated the problem you are solving (this is skewed somewhat towards B2B software).

Identifying a problem is not enough

Whilst clearly important, it’s not enough to just identify a problem. Yes, your business may well solve a real pain point in theory, but it still doesn’t mean you can actually execute on it successfully. Commercial complexities (e.g. a deeply tangled industry) can still make even a good idea impossible to actually execute on, especially for a startup.

Be patient before diving in

Even if you can afford to, it might not be a good idea to quit your job until you’ve found something you are really passionate about building/solving. Identify a problem you’ve had in your own life/job, don’t try to force ideas out of nothing – it’s contrived. You have to be patient and let the idea come to you – if that means working for others until you do, then so be it.

Make the most of working for larger companies where you can network with those in business areas you are not familiar with. The gaps will be there, you just need to put yourself in the position to see them.

It’s much harder than you could possibly imagine

It’s going to be infinitely more difficult to succeed than you think. Success is a combination of a good idea, impeccable execution, luck, who you know etc.

There is a myth doing the rounds that it’s easier than ever to start a startup. In reality it’s just a cultural mind shift, the actual chances of succeeding are just as low as they have ever been. The ‘anyone can build an app’ delusion is really unhelpful.

And, it will take five times longer to succeed than you probably think it will. Most successful startups have been operating a lot longer than you think they have – you’re facing many years of blood, sweat and tears (and you still might fail).

Also, be willing to acknowledge, therefore, that the idea of running your own business might look much better on paper than it does in reality.

A tech startup is not just about tech

Marketing and sales will most likely be 80% of the effort in a startup’s success story. Don’t be under any illusions – a good product won’t sell itself. A warning for you techies out there – don’t be fooled into thinking that you only need tech skills to get a business off the ground. It’s not true, seriously!

Accordingly, non-techie entrepreneurs shouldn’t feel disadvantaged that they don’t have tech skills. This is not to say that tech isn’t hugely important (I obviously think it is), but it’s not a one-way ticket to success.

Recognise when the game is up

Be prepared to recognise when it’s time to call it a day. Cut through the false positives in order to make an objective assessment of your business. There’s a fine line between sensible persistence and blind optimism. It’s not always advisable to keep the faith if every indicator of business healthiness is against you. Maybe it’s just not a good idea, or maybe the market is just not ready, for any number of commercial reasons, to accommodate your product.

Don’t do it for money

Don’t bother doing it if you are doing so in any way for money. Be totally honest with yourself from the outset. Unless your only reason for starting a business is to build a great product/service then don’t bother. For most people, it they are truly honest, the idea of becoming rich is the real motivator deep down.

What a Storming Idea

| Comments

Firstly, an important disclosure – having only just stumbled across the practice of Event Storming, I’ve not yet had the opportunity to experiment with it myself. However, sometimes a concept/practice makes such immediate sense in one’s mind, that one feels compelled to talk about it!

This post feels like quite a natural successor to my previous post on Event-driven Architecture. In that post, I discussed some of the tangible benefits of EDA, and this follow up introduces the practice of Event Storming, a fledgling Domain Driven Design influenced practice that goes hand in hand with EDA to assist product teams in exploring the complexity of their business domain.

It’s not my intention to write a long article on Event Storming – I encourage you to read Alberto Brandolini’s introduction instead – but I want to share my biggest takeaways from this new learning.

The Domain-Relational Impedance Mismatch

In my experience of domain modelling, the biggest mistake I’ve seen made, time and time again, is to fail to involve the most important people ‘in the room’. The reason for this is that far too many projects take a persistence oriented approach to modelling, as opposed to a business oriented approach. Put simply, by tending to address model genesis using only technical team members, teams get sucked into allowing technical persistence concerns to shape their modelling approach. It’s not unusual for domain experts to be involved only at the user interface level, leaving software engineers and DBAs to make decisions that they are probably the least qualified to be making. The ultimate success of a software project resides on how well domain complexity has been tackled, and so it seems crazy for domain experts to be absent from the modelling process.

Let’s be absolutely clear – a domain model exists in spite of persistence technology, not because of it. A model is not a technical concept; it is a reflection of a business domain that exists in real-life, not inside a machine.

“A domain model exists in spite of persistence technology, not because of it.”

Whilst NoSQL technology is becoming more popular, it’s still fairly normal to see domain modelling tackled using entity relationship (ER) diagrams – something quite familiar to engineers and DBAs. That wily old fox, the relational model, is still recognised by many as the de facto way to practice domain modelling. However, Domain Driven Design (DDD) teaches us a much better way, and does not make room for persistence concerns in our conversations – models spawned using DDD practices typically appear very different to what they’d look like had ER modelling been applied instead.

You’re probably familiar with the object-relational impedance mismatch concept, but I think our problems extend much further than that; I believe DDD teaches us of a domain-relational impedance mismatch. That is, the relational model is not a natural fit for addressing domain model complexity, and thus should not be trusted to do so.

The light at the end of the tunnel

So, we know we’re doing it wrong, but how do we then ensure we get both the right people in the room (domain experts), and a method to support effective communication of domain complexity? This is where I believe we should look to Event Storming to help us out.

From my initial learnings on Event Storming, I can wholeheartedly say that this technique appears to offer a very attractive way to ensure focus remains on the business rather than technical implementation. It forces the right people – domain experts – to be in the room, thus ensuring core business flows are identified, bounded contexts are defined and consistency boundaries are clarified.

Event Storming does infer an event-driven architecture (EDA), but I hope my previous post serves to address why you should be doing that anyway. It finally gives us an accessible technique that allows domain experts and technical specialists to work together to tackle domain complexity effectively. It’s a really exciting prospect and I look forward to applying it both in existing and future projects.

The Silicon Valley Mindset

| Comments

As I’m currently on a temporary work assignment in San Jose, I recently wrote an article for The Croydon Citizen about my experiences in Silicon Valley. It’s great to have become part of the Croydon Tech City movement, and so it was nice to contribute an article in support of the great work being done to promote Croydon as a new hub for tech startups in the UK.

Check out my article here:

Event-driven Architecture FTW

| Comments

A quick primer on EDA

First of all, let’s delegate to the all-knowing source, Wikipedia, to give us a concise definition of Event-driven architecture (EDA):

“Event-driven architecture (EDA) is a software architecture pattern promoting the production, detection, consumption of, and reaction to events.”

It’s pretty unusual to encounter a software engineer who hasn’t dabbled with publish-subscribe semantics – or anything resembling the Observer Pattern – at some point in their engineering adventures. It’s hard to ignore, when faced with a challenge to simplify some programming problem, the draw of decoupling a producer of something from one or many consumers of that something. Not only does it divide responsibility (making code easier to test for starters), it also clears the way for an asynchronous programming model in use cases where producers needn’t be blocked by the actions of an interested consumer.

If you can appreciate the power of publish-subscribe, you’re unlikely to have much trouble in figuring out how Event-driven architecture (EDA) could help you. It only requires a small leap of faith to realise that pattern you used to solve that micro problem, in some isolated component within your system, can become a macro pattern to underpin a system-wide architectural style.

So, let us return to that Wikipedia definition. I think we could reasonably interpret that definition to understand EDA as the art of designing a system around the principle of using events – concise descriptions of state changes or significant occurrences in the system – to drive behaviour both within and between the applications that make up an entire platform.


Production of an event involves some application component/object generating a representation of something that has happened – maybe a state change to an entity, or maybe the occurrence of some activity (e.g. a user viewed page X). Rather than notifying specific consumers, the producer will simply pass the event to some infrastructural component that will deal with ensuring the event can be consumed by anything that’s interested in it.


It’s my belief that the Wikipedia definition is essentially referring to the mechanism that sits between a producer and consumer (the infrastructure piece) – the logic that ensures events are passed to interested consumers.


Consumption is the act of an interested consumer receiving an event for which it is interested in. This is still most likely a part of the infrastructure piece (e.g. a message bus). There might be a whole bunch of stuff here about reliable delivery, failure recovery etc.


Reaction is the fun part where a consumer actually performs some action in response to the event that it has consumed. For example, imagine you register a user on your website and you want to send them a welcome email. Rather than bundling the responsibility for sending email within your domain model, just create a consumer to listen into a UserRegisteredEvent and send an email from there. This nicely decouples the email delivery phase and also allows it to be done asynchronously – you probably don’t need, or want, email delivery to be a synchronous action. Also, imagine you have further requirements relating to post registration behaviour – your domain model would soon become unwieldy with all that extra responsibility. Not one to want to violate the Single Responsibility Principle (SRP), you sensibly use event-driven programming to separate all those actions into separate consumers, allowing each behaviour to be tested in isolation, and retain simplicity in your domain model.

Events everywhere

As previously alluded to, it’s far from unusual to see fragments of event-driven programming in many applications. However, it’s another step entirely to see event-driven programming adopted in such a way that it becomes an endemic architectural pattern – that is, where an entire platform uses events to underpin all its moving parts. To be successful with EDA, it needs to become a fundamental mindset that drives all design decisions, rather than just a pattern that used in some isolated parts of a wider system.

I want to evaluate a series of application/platform features that, whilst may sit outside of core business workflows, fit really nicely with EDA. This should help to realise why EDA can be a very fruitful path to follow.

WebHooks (the Evented Web)

WebHooks is a fairly high level concept that encompasses the use of HTTP as a form of publish-subscribe mechanism. Whilst there are more recent attempts to create more standards around this – calling it the Evented Web – the fundamental idea is to allow a consumer to register a callback URL with some remote resource (using HTTP) that will be invoked by the hosting service whenever some event occurs relating to that resource. A really well known example is post-commit hooks on Github – any external tool (e.g. a CI server, a bug tracker) can register interest in commits to a repository and react in whatever way that makes sense in their context.

I’m pretty convinced that the Evented Web paradigm has got a lot of growth potential and will become a de facto expectation of any well designed service API. What should be clear is how easy it would be to add WebHooks functionality to your own application if you are already applying EDA across the board.

Big Data

I do kind of detest using the term ‘Big Data’ as it’s uncomfortably ambiguous and vague. However, for the purposes of this article, I’m going to stick with it (strike me down). If, for now, we think of Big Data as a way of capturing shed loads of data to enable business intelligence, we should be able to see quickly that events occurring within an application might be a great source of all that lovely data. If you’ve adopted EDA, your Big Data pipeline may just be a consumer of all your application events. You might dump all those events into HDFS for future batch processing, and, given you are essentially subscribing to a real-time event feed, you might also drive your real-time analytics use cases as well.


Unless you are someone who really couldn’t give a damn, you’re going to want some monitoring tools in place to give a thorough insight into the health of your system in production. Common monitoring solutions may include, amongst other things, a bunch of smart dashboards full of sexy graphs, and some threshold alerting tools to help spot potential problems as soon as an incident starts. Either way, both these tools are driven by time series data that represents ‘stuff’ happening within an application. What better way to capture this ‘stuff’ than sucking events in from your event streams? Again, if you’ve already followed EDA, you’re going to get some pretty quick monitoring wins.

Log file analysis

There is possibly some crossover here with monitoring, and even Big Data, but I think it deserves its own special mention. If you imagine logs as comprehensive streams of events, assuming you’ve followed an EDA style, you can pretty much get log analysis for free. Just suck your logs into some analysis tools (e.g. Logstash and Kibana), and you’re pretty much good to go. Just remember that it’s perfectly reasonable to use events to represent errors too (which could contain any relevant stack trace).

Test-driven development (TDD)

Okay, so TDD is not an application feature, it’s part of the engineering process. However, if our architecture decisions can help to improve our quality process, then that can’t be a bad thing. Event-driven programming encourages a code level design approach that follows the Tell, Don’t Ask pattern. You tell an object to do something, which leads to an event, or events, being published. So, what’s this got to do with TDD? In my experience, it’s much easier to reason about your code, and define more coherent specifications, if your testing style mimics ‘given this input (command), I expect this event, or events, to be produced’. A test first approach is very compatible with this style, and makes you think in a much more behavioural (think BDD) way.

For the win

Right, we’ve covered a primer of EDA and seen how it can be used to drive both core business flows, cross cutting concerns, and even our quality process. I believe this knowledge makes a very compelling case for adoption of EDA – why would you bake in custom solutions for capturing Big Data, doing health check monitoring etc, when you can simply piggy back these features off of your core architecture? Hopefully, you wouldn’t! All sorts of wonderful acronyms pop into my head at this point – KISS, DRY, SRP etc. And don’t we all love acronyms?

But can we go even further?

Going the whole hog with event sourcing

This discussion leads so elegantly into the final part of this blog post – event sourcing. Event sourcing is an approach to persistence that means the state of an entity – more specifically, an Aggregate Root in DDD speak – is made up of all the events, representing state changes, it has emitted over time. So, rather than store current state in a database, you simply load the historical sequence of events (from an event store) and apply them in order to obtain current state. I will leave it up to the reader to pursue the full benefits of using event sourcing, but here are some of the headline wins:

  • Supports a simple, and very scalable approach to persistence. An event store can be as simple as an append only log file.
  • Gives you a full history of every state change and that’s great for producing an audit log (something you might want anyway, even without event sourcing).
  • Can still utilise snapshots of current state for performance optimisation when replaying.
  • Very compatible with a test first, behavioural approach to testing.
  • Plays very nice with the CQRS architectural pattern, a very practical way to bake scalability into your applications by maintaining separate paths for reads and writes.

If you’re going to go down the EDA route, why limit just your applications to an event-driven style? If it’s possible to maintain state via the events you’re already publishing, why maintain a separate database at all? Storing events for all time might seem like a storage burden but, seriously, are we still worrying about the cost of storage in 2013? Storing current state in a database is a ‘lossy’ approach – once you’ve overwritten existing state, you can never get it back. Martin Thompson summed all this up so concisely in a recent tweet:

There are way too many compelling reasons for wanting to keep a history of everything and it’s impossible to avoid being courted by that proposition.

I think this is a really fascinating area of exploration – sometimes traditional CRUD might be a better choice, but the more I work with event sourcing, and the more comfortable I feel with EDA in general, the harder it becomes to find good reasons against following this path.


So that wraps up a fairly lengthy discussion on EDA and how an event-driven mindset can promote a coherent strategy for the way you build software. One of the toughest things we face as software engineers is maintaining a consistency in style, especially in large teams. My ideal vision is for code to speak the architecture patterns on which it is crafted, such that no engineer could ever doubt what is expected of them when refactoring or adding new features. For me, EDA is an enabler of this vision, and will help to bridge the gap between doing the right thing (building product features your users love), and doing the thing right (consistent and elegant technical solutions).

The Fallacy of the Omniscient Domain Model

| Comments

Complexity. As software engineers, it’s pretty hard to make it beyond lunch without someone mentioning it. But, what is it exactly? Most of us probably think we sussed this out a long time ago – hell, we’re probably preaching the KISS and DRY principles on a daily basis. However, complexity in software is a multi-faceted beast and, with that in mind, taking the time to reflect on your own view of complexity, you may reveal a bunch of defective preconceptions.

One of these preconceptions, one which I’ve failed to question effectively myself in the past, is that a ‘simple’ domain model is one that can encompass the entire domain of an enterprise. The other preconception is that ‘simple’ architecture means few moving parts.

Let’s address each one individually.

Imagine a completely DRY, one-size-fits-all domain model that manages to perfectly model the domain of an entire enterprise. This model is infinitely malleable and able to accommodate future changes without adversely affecting existing code dependant on it. Are you struggling to imagine this? I hope so, because I don’t think it’s possible in anything other than the most basic of business domains. Regardless, this a very common approach and, instead of being simple, it adds an alarming amount of overall complexity. Ultimately, the intertwining of vaguely related entities makes it impossible to make changes in one place without having to untangle deep dependencies elsewhere.

If, instead, you apply some of the core principles of Domain Driven Design (Domains, Subdomains, Bounded Contexts) to any enterprise, it’s natural to materialise multiple models that exist within the different subdomains and contexts that make up the wider domain. This approach, which actually reflects the structure of the real business, reduces overall complexity – dependencies are untangled by design, making changes more achievable in isolation.

One important thing to note here – you’re not necessarily violating the DRY principle if an entity appears in multiple contexts. Maybe you’ve just failed to successfully tweak the Ubiquitous Language for each context, such that you’re recognising that entity to mean different things in those different contexts.

So, now onto our second preconception – does simple architecture mean few moving parts? My previous argument may seem overly convenient given it implicitly provides support to my next case – it’s most likely true that the introduction of multiple bounded contexts and/or subdomains will lead to more moving parts. But does that actually equate to complexity?

We’ve all seen over-engineered software that throws in a message queue here, another message queue over there, neither appearing to offer any discernible value. I’m certainly not advocating that! But these over imaginative solutions shouldn’t be confused with DDD influenced design decisions to separate contexts and integrate them effectively where necessary.

Applying the KISS principle to architecture doesn’t necessarily mean a system with fewer moving parts. A simple system is one that is highly adaptable and reactive to business changes – thus, the number of moving parts alone can’t be considered a good measure of complexity.

I hope the arguments I’ve made in this article help to address some common misconceptions. I do believe the concept of a simple, single, all-knowing domain model is a fallacy. Feel confident to apply DDD principles, be proud of your separate cohesive models, and don’t fear the additional moving parts you might adopt in the process.

Complexity is not always what it seems.