A microservices implementation retrospective

Over the last year at realestate.com.au (REA), I worked on two integration projects that involved synchronising data between large, third party applications. We implemented the synchronisation functionality using microservices. Our team, along with many others at REA, chose to use a microservice architecture to avoid the problems associated with the “tightly coupled monolith” anti-pattern, and make services that are easy to maintain, reuse and even rewrite.

Our design used microservices in 3 different roles:

  1. Stable interfaces – in front of each application we put a service that exposed a RESTful API for the underlying domain objects. This minimised the amount of coupling between the internals of the application and the other services.
  2. Event feeds – each “change” to the domain objects that we cared about within the third party applications was exposed by an event feed service.
  3. Synchronisers – the “sync” services ran at regular intervals, reading from the event feeds, then using the stable interfaces to make the appropriate data updates.

Integration Microservices Design

Things that worked well

  1. Using a template project to get started

At REA we maintain a template project called “Stencil” which has the bare bones of a microservice, with optional features such as a database connection or a triggered task. It is immediately deployable, so a simple new service can be created and deployed within a few hours.

  1. Making our services resilient

We started “lean” with synchronous tasks that were triggered by hitting an endpoint on the service. One of the down sides of splitting out code into separate services is that there is an increased likelihood of errors due to network gremlins, timeouts and third party systems going down. Failure is always an option. In our synchronous, single-try world, the number of unnecessary alerts which required manual intervention just to kick of the process again was a drain on our time. So, we changed all our services to use background jobs with retries, and revelled in the relative calm.

  1. Making calls idempotent

Given that we had built in retries, each part of our retry-able code needed to be idempotent so that a retry would not corrupt our data. Using PUT and PATCH is great for this, but sometimes we did have to do a GET and make a check before making the next request.

  1. Using consumer driven contract testing

Testing data flows involving 4 microservices, two third party applications and triggered jobs using traditional integration tests would have been a nightmare. We used Pact, an open source “consumer driven contracts” gem developed by one of REA’s own teams, to test the interactions between our services. This gave us confidence to deploy to production knowing our services would talk to each other correctly, without the overhead of integration test maintenance.

  1. Where possible, exposing meaningful business events, not raw data changes

It took a while to really grasp the meaning of this, however once we “got” it, it made sense. It is probably easiest to explain by example. One domain object had a “probability” percentage field that could be changed directly by a user. Instead of exposing “probability field changed” as an event, we exposed “escalations”. This meant we were hiding the actual implementation of how the “rise in probability” was executed in the system, and not asking every other system that inspected the event feed to have to re-implement the logic of “new value of probability is greater than old value, therefore its likelihood has increased”.

  1. Automating all the things

We are lucky enough to be able to use AWS for all our development, test and production environments. We used continuous deployment to our development environment, and we had a script to deploy the entire suite of microservices to the test environment at one click. This made the workflow painless, and helped counteract the overhead of having so many codebases.

  1. Using HAL and the HAL Browser

HAL is a lightweight JSON (and XML if you are so inclined) standard for exposing and navigating links between HTTP resources. The “Stencil” app comes with Mike Kelly’s HAL browser already included (this is just an HTML page that lets you navigate through the HAL responses like a web browser). As well as the resources for the business functionality, we created simple endpoints that exposed debugging and diagnostic data such as “status of connection to dependencies” or “last processed event” and included links in the index resource so that finding information about the state of the service was trivially easy, even for someone who didn’t have much prior knowledge of the application.

Things that didn’t work well

  1. Coming up with a way to easily share code between projects

Our first microservices implementation used the strict rule of “one service has one endpoint”. This made services that could be easily deployed separately without affecting or holding up other development work. However, it also increased the maintenance overhead, as each new service was made from a copy of the previous one and then modified to suit. When a problem was found in the design of one of them, or we wanted to add a new feature, then we had to go and change the same code (with just enough variations to be annoying) in each of the other projects. The common code was more structural than business logic (eg. Rakefile, config.ru, configuration, logging), and it was not written in a way that made it easy to put in a gem for sharing.

Things we have questions about

  1. What is the “right size” for a microservice?

Soon after having completed the first microservices integration project, we had an opportunity to do a second. This time, instead of making many different “event feed” services that each exposed a single type of event, we made one event service that had an endpoint for each different type of event. Some might argue that we were stretching the definition of “microservice”, however, there was still at tight cohesion between the endpoints, as they were all exposing events for objects in the same underlying aggregate root. For us, the payoff of having fewer codebases and less code to maintain made the trade-off worth it, as the turnaround for a exposing a new type of event was a matter of hours, instead of a matter of days.

Integration Microservices Design, Take 2

I suspect that the “right size” is going to vary between projects, languages, companies and developers. I’m actually glad we made our services what I now consider to be “too small” in the first project, just as an experiment to work out where the line was for us. I now think of the “micro” as pertaining more to the “purpose” of a service than the size. Perhaps “single purpose service” would be a better term – but it just ain’t as catchy!

 

 

  • Sebastian

    It’s a nice retrospective. One can tell it was indeed a nice learning exercise. Good for you and good luck with future designs.

  • Pingback: Links & reads for 2014 Week 39 | Martin's Weekly Curations()

  • Pingback: A microservices implementation retrospective | DiUS()

  • Mortier Andries

    How did you manage to aggregate your logging or setup some monitoring of your services?

    • Ditto. I’d like to hear about this too.

    • Nico Van Belle

      Check out services as logstash and new relic. Best part: they can both be combined

      • Mortier Andries

        cloud services might not be an option, however it seems more a log-mining tool rather then a roadpath logging thing, meaning, maybe you would benefit more from a message log queue?

    • Beth S

      Hi Mortier, we used Splunk for log aggregation, and Nagios for monitoring. Both Splunk and Nagios configs came baked into the base AMI that we used to create each service from. The Stencil application included HTTP endpoints for Nagios to monitor health, and an optional “passive check” reporter class for services that had triggered tasks.

  • Alex

    How did you maintain data integrity between micro services?

    • Beth S

      A range of strategies. Jobs with retries to ensure that transient failures affected the syncing process as little as possible. Monitoring to let us know if a more long term problem was preventing data being synced. Using change events as a trigger to go to the original source and retrieve the current data, rather than using the value associated with the change event. Being clear as to which system was the source of truth for each piece of data.

  • Pingback: A microservices implementation retrospective | ...()

  • gogogarrett

    Is the project “Stencil” open source?

    • Beth S

      Hi Gogo,

      Not it isn’t, but it would not be much use if it was, because the technology choices are specific to the one group within realestate.com.au. Different groups within the company have their own service templates. But to give you an idea of what is included:

      Application:
      JSON+HAL Index endpoint
      JSON+HAL Healthcheck endpoint
      HAL Browser
      Logger
      Example of how to inject configuration (eg. environment variables)
      Database connection (optional)
      Endpoint to trigger asynchronous job (optional)
      Passive check monitoring (optional)
      Code for creating a Pact (optional – for a service consumer)
      Code for verifying a Pact (optional – for a service provider)

      Deployment and configuration:
      Log aggregation config
      Log rotation config
      Performance monitoring config
      Web server config
      AMI creation scripts
      Deployment scripts

      Keep an eye on the realestate.com.au tech blog, because I believe there will be a post on Stencil coming up soon.

      • gogogarrett

        I’d love to read that article. Look forward to seeing more about Stencil.

  • Cristopher Stauffer

    Interesting post. For using contract driven tests, how did you work out the contracts for situations where the interesting portion of the contract was actually a downstream effect. For example, you call service X which calls service Y. Services Y returns “200” response, but actually sends out an email for example. Did you have tests that ultimately check that the whole chain works – or did you test each link individually and that satisfied. I would like to try the DiUS contract testing library, but if most of the chain involves just “200” as a response, there may be less value there.

    • Beth S

      We did not test the whole chain. We relied on Pacts, some manual tests, and then made sure there was very good monitoring in production. For some scenarios, we used synthetic transactions with monitoring to ensure that the full end to end process was working in prod.

      I think you are probably right about consumer contracts not being particularly useful for your example of a service Y that just returns a 200. If you wanted to reduce your reliance on integration tests, another strategy is to use a shared “fixture” of the request between the two projects, to do a manual consumer contract. For example, have a test in the consumer project that asserts that the request in the fixture can actually be made by the consumer code, and then use the same request fixture in a functional test for the provider that ensures that the right email is created by that request.

  • Beth S

    We’re having problems with notifications of comments not coming though, so ping me on twitter @bethesque if you post a question for me that I don’t seem to have noticed.

  • Thanks for the very informative post!
    Though what I did not really found in this article is how to _actually share_ code between projects without breaking the micro- / single-purpose-service -strategy?

    • Beth S

      We tried a few techniques, each of which have their pros and cons. As mentioned, following the principle of “put things together that change together”, the second time we made one service instead of many. Another thing we tried was writing code that had extension points for customisation, that could be directly copy-pasted between projects using a script. Another thing we could have tried, but didn’t, was creating each project by forking a base project (rather than starting from a clean GIT repo), and then pulling in the changes when needed. Again, writing code that is designed to be extended by adding new files, rather than modifying the base files, would help here.

      • The git/fork approach seems most sane to me: have a `base` branch on upstream for these shared files and set up a script to submit PRs for forks upon upstream changes.

        • Beth S

          We’re about to try this technique with a set of almost duplicate codebases, so I’ll try to remember to report back on its success (or lack thereof!).

          • Beth S

            Reporting back on the “use a master codebase which is an upstream of many child codebases” pattern for microservices that are quite similar. I can report that this has worked very well for us (we have 1 master and 6 child codebases). The reason it has been successful is that we have strictly followed the rule of “allow behaviour to be modified by adding files, not by modifying existing ones”. This has meant that we need to think carefully about how we design our master project, but it has made pulling from upstream a ridiculously straightforward (and potentially automatable) process.

  • James

    Hi Beth,

    Very interesting article. How do you guys approach testing your microservices locally on your machine? I.e. Is it possible spin up all of your microservices locally and manually test them that way, or do you deploy everything to AWS?

    Thanks

    • Beth S

      It would be possible to spin them all up, but generally we just deployed them to AWS, as they’d usually need access to the third party systems anyway.