How We (Mostly) Survived The Stormy Apocalypse!

It’s not news to anyone any more, so I’m sure everyone knows Amazon Web Services (our major cloud infrastructure provider) suffered an outage within one of their availability zones on Sunday June 5th. AWS is split up into various geographic regions, and within each region, a number of availability zones. I’m going to assume most readers know about this, but if you don’t, check out Amazon on how they describe these things. On Sunday one of these availability zones suffered a “power event”, owing to Sydney’s wild weather on the weekend, bringing it to its knees. Lots of Australian based websites had major problems.

Most of REA’s critical and consumer facing services appeared to continue to function. We had an issue with one of our front end web apps, and one of our native mobile apps suffered from a third party dependency being unavailable. But while we weren’t totally unaffected, it was overall a satisfying outcome.

I’ll try and keep focused on the technical details of how we build for events like this, but it’s important to understand the value of this overall—in terms of lost business (people being unable to visit the site), brand reputation, and other measures. There’s a big impact to us continuing to be available, and it’s a trade off against the value that provides the business.

How did we manage this? And how did we deal with the problems we did have? There were a few things which helped. A number of our core systems are still in our own data centres (and were thus unaffected); we deployed some of our important systems in AWS in a multi-region setup; we have some critical stuff in S3 (which wasn’t affected by the outage); and there’s some great work our teams have done in the architecture and building of our systems, and their operational delivery, that have led to this resilience.

Services in the (Old) Data Centre

Some of our core systems are still located within our data centre. Obviously these were immune from the AWS failure themselves, but many depend on AWS hosted systems behind the scenes. We’ve built (most) of these to degrade gracefully when there is a failure. So whilst the services themselves weren’t affected, often times systems they depended on could be. In these cases, we tended to handle any unavailability gracefully.

Designing For Failure

Running so many different systems, which all connect up, it’s inevitable that some will fail. We know that in this micro-service heavy, cloud based world we live in, systems can’t always be relied on. So we need to build our websites and apps to still be usable in event of failure of some of the systems they depend on, and have their experience degrade gracefully when problems do occur.

In the case we experienced on the Sunday in question, our ad server was unavailable. Some of our experiences slowed down as a result of this, and one of our apps ended up being broken. It provided a good lesson for us in what can go wrong and how we could handle things better. On our main website, consumers wouldn’t have noticed, except for the absence of ads.

Choosing to deploy onto S3

One of the ways in which we try to minimise our risk is by choosing appropriate patterns for our applications. For our homepage, and certain other systems, they’re usually just static assets (HTML, CSS, JavaScript, images) with perhaps some Javascript, all executed client side.  For these types of applications, we deploy straight to S3 buckets, a form of serverless architecture (which is so trendy these days!).

S3 by its nature is more durable than an EC2 instance, and more likely to survive an AZ failure. It’s multi-AZ by default, and while the events of the weekend have shown that just being mutli-AZ isn’t necessarily enough to be resilient to an AZ failure, the S3 service held up well.

From that perspective a number of our services were fine during the issue by virtue of being hosted on S3.

Running a Multi-AZ Setup by Default

Almost all of our production systems are multi-AZ where possible. This helps, because AZs are meant to be independent of each other, and so an issue in one won’t affect resources in another. For much of the outage on Sunday, this was sufficient to keep systems running (albeit in reduced redundancy state) during the outage.

This isn’t infallible though. We noticed a couple of ways this didn’t help. One is when services aren’t available to be run in multiple availability zones, or to fail over automatically. Redshift, for example, is single-AZ only.

If we need resilience of a service, and it’s not available in multi-AZ mode, then we’ll consider what else we need to do to provide the necessary level of availability (for example, multi-region as described below).

There were other multi-AZ issues we observed during the event. Because of Amazon’s issues internally, a number of their APIs for controlling AWS weren’t working, meaning that we had reduced or no control over our resources. This meant a loss of redundancy, and when events did happen, they happened with unexpected consequences.

So there were some weird things happening once the affected AZ had begun to recover which caused us some problems. In a lot of cases, new instances were taking a long time (up to an hour!) to register with an ELB (Elastic Load Balancers – resources AWS provides to balance load across multiple servers) once their ASGs (Auto Scaling Groups – an AWS resource which controls starting and stopping EC2 instances to ensure you’ve got enough instances running to handle your load) had started them up and declared them healthy. In one case this resulted in a service outage.

In this case, when the faulty AZ was restored, and the ASG was finally able to spin up a new instance, it actually did so in the AZ which hadn’t had any problems. Our default setup is to set our ASGs to balance instances across AZs. So once this new healthy instance was healthy, the ASG noticed it was unbalanced (having two instances in one zone, and none in the other). It killed off the original (working) instance, and spun up a healthy instance in the other AZ. Now both of these instances were in a healthy state, but “waiting to register” on the ELB. So obviously in this case we saw errors. Our solution eventually was to kill the ELB and recreate it, but only after seeing no instances register for a very long time. Being able to easily redeploy our infrastructure at any time was really valuable in this case.

Protecting our Systems Beyond an AZ Failure – Multi-Region

For a number of our core consumer experiences, we have a really high availability requirement. So for a number of our core and top priority systems hosted on AWS infrastructure, we’ve pursued a multi-region setup. This gives us even more tolerance in the event of failures, and is important to support the availability targets described above.

Our multi-region setup is quite neat. We use a number of techniques together to achieve good, automated redundancy across two AWS regions.

Independent Systems

Our front-end systems usually consist of a number of components. There will be a web-app and/or an API. It’ll have some sort of backing store (which will be appropriately chosen based on the nature of access – S3, RDS, ElasticSearch, etc). We’ll usually feed that backing store from somewhere. We’ve embraced immutable architecture – all of these components are disposable and can be built from the ground up. This changes the way in which we can think of them, and in many cases if something goes wrong, we can just build them again.

Many of our front end systems are backed by a data publishing pipeline that looks like:

Data Publishing Pipeline

That is, we have a master data store where the data is maintained. We then have a feed from this—in some cases, files are exported and stored on S3 buckets, in others, it’s an Atom style HTTP feed API, or a Kinesis stream, or just any consistent interface to data. Next is the “Feeder”, which reads from this consistent “Feed” interface, and pumps documents into the data store used for the front end (in this example, an Elastic Search cluster). This is then fronted by an “API”, which is then consumed by a web app. These systems are tolerant of failure in many parts of them: as long as the cluster is available, the API will continue, regardless of the state of the feeder or source of data.

When going multi-region, we create a copy of each of these instances in each region. In doing this, we’ve embraced eventual consistency. We build the systems in each region totally independently—there will be a copy of each of these systems in each region. The only thing that will be common is the source of the data (the “Master Data Store” and “Data Feed”).

In this way, if one region has problems, the other is totally unaffected.

Ok – so there’s two of everything. Now what?

When building this infrastructure, we were looking for a way in which we could nicely direct clients to the appropriate endpoint. In many of our cases, we have clients of our APIs in both Europe and Australia. These connections are latency sensitive (auto-suggesting items in a search box for example). Because of this constraint, we chose Frankfurt as the second region for our systems. There’s another twist though—to keep redundancy for the clients we still want them to talk cross region if their local copies aren’t available.

At this point, with the two copies of our infrastructure, we end up with two CNAMEs—one for each of the ELBs for both our APIs. Lets call them service-name-api.ap-southeast-2.backend.realestate.com.au (for Sydney) and service-name-api.eu-central-1.backend.realestate.com.au (for Frankfurt).

We looked into a number of options of how we could switch between these neatly, but the one that seemed to work the best for us was a combination of AWS Route53 latency routing and Route53 health checks.

AWS Route53 Latency Routing

When setting up CNAMEs in Route53, it’s possible to define a “routing policy” for when there’s multiple copies of the same CNAME record. Options for this when there’s multiple CNAME records of the same name are “weighted” (sending requests at proportions specified), “Failover” (for active-passive failover), “Geolocation” (based on the geo location of the user) and “latency” (the latency is calculated from 53 points around the world to each CNAME, and then chooses the one with the lowest latency for the end user). We chose latency based routing as this would allow the resolution of the right domain name for our end points based on the best option for our consumers or users.

AWS Route53 Health Checks

In addition to latency based routing, we added a HealthCheck to each CNAME, which would check a health-check endpoint on each API. If any service stops responding, the CNAME won’t be considered by the DNS resolver.

With all of this in mind, we realised we needed to create a new CNAME for our API, and have Route53 use these properties to resolve the correct region based on where the client is.

To configure all of this using CloudFormation, we use the following snippet (actually, it’s a little edited for brevity):

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "API Health check",
  "Parameters": {
    "HostedZoneName": {
      "Type": "String",
      "Description": "Zone name",
      "Default": "backend.realestate.com.au"
    },
    "CNAME": {
      "Type": "String",
      "Description": "canonical name"
    },
    "DeploymentRegion": {
      "Type": "String",
      "Description": "Deployment region"
    }
  },
  "Resources": {
    "HttpHealthCheck": {
      "Type": "AWS::Route53::HealthCheck",
      "Properties": {
        "HealthCheckConfig": {
          "Port": "443",
          "Type": "HTTPS",
          "ResourcePath": "/healthcheck",
          "FullyQualifiedDomainName": {
            "Fn::Join": [".", [
              { "Ref": "CNAME" },
              { "Ref": "DeploymentRegion" },
              { "Ref": "HostedZoneName" }
            ]]
          },
          "RequestInterval": 10
          "FailureThreshold": 2
        },
        "HealthCheckTags": [
          {
            "Key": "Name",
            "Value": {"Ref": "AWS::StackName"}
          }
        ]
      }
    },
    "DNSRecord": {
      "Type": "AWS::Route53::RecordSet",
      "Properties": {
        "HostedZoneName": {
          "Fn::Join": [".", [
            { "Ref": "HostedZoneName" },
            ""
          ]]
        },
        "Name": {
          "Fn::Join": [".", [
            { "Ref": "CNAME" },
            { "Ref": "HostedZoneName" }
          ]]
        },
        "Type": "CNAME",
        "TTL": "10",
        "ResourceRecords": [{
          "Fn::Join": [".", [
            { "Ref": "CNAME" },
            { "Ref": "DeploymentRegion" },
            { "Ref": "HostedZoneName" }
          ]]
        }],
        "HealthCheckId": { "Ref": "HttpHealthCheck" },
        "Region": { "Ref": "DeploymentRegion" },
        "SetIdentifier": {
          "Fn::Join": [".", [
            { "Ref": "CNAME" },
            { "Ref": "DeploymentRegion" },
            { "Ref": "HostedZoneName" }
          ]]
        }
      }
    }
  }
}

and create one of those for each region (in our case – ap-southeast-2 and eu-central-1). For our example above, the parameters we use are:

  • CNAME : service-name-api
  • HostedZoneName: backend.realestate.com.au
  • DeploymentRegion – either ap-spoutheast-2 or eu-central-1

The key parts here are:

  • We’re creating a CNAME for “service-name-api.backend.realestate.com.au” for each region.
  • We’ve defined a health check pointing /healthcheck on each region’s API.
  • The TTL is 10 seconds.
  • By specifying the “Region” parameter, we’re implying that we want latency based routing.

When you view this in the console, you’ll see:

Route53 Latency Based Routing

And the health check:

Route53 Health Check

So it just flips over?

Mostly. We also had some other considerations here. How soon would we want to flip if we had a problem? We decided on “as soon as possible”. So we chose a TTL on the CNAMEs of 10seconds so any resolution would only be cached for 10 seconds. This potentially puts lots of load on the DNS servers, however, and since we’ve using AWS for this we need to consider the cost. At $0.60 per million requests, the number of expected requests meant for us it was a sensible calculation.

One further complication we had was with the JVM (Java virtual machine). By default the JVM caches DNS resolutions forever. It took lots of experimenting to work out that—a combination of the correct JVM parameters (we needed to set networkaddress.cache.ttl to something low) and correctly configuring our HTTP connection pooling.

Putting it all together

We’ve now got a CNAME which will resolve to the closest (defined by latency) endpoint, based on availability. We can then either hit these endpoints directly, or stick something like CloudFront or Akamai in front for further caching options.

An interesting gotcha is around the latency based routing: AWS doesn’t always know where the Akamai edge servers (or any edge servers for that matter) are located, so might choose the wrong destination for you. AWS uses the location of your DNS resolver to choose what’s closest, so if you’re using something like Google’s public resolver (8.8.8.8), you can end up with interesting outcomes as well.

Test Test Test

As a result of the outage, many people have talked about using Netflix’s Chaos Monkey. We’ve not gone down that path. What we have done, however, is run disaster recovery exercises against our crucial multi-region systems. Before deploying, we make sure we can take down an entire region, that the system continues to work, and that we can rebuild the lost region without too much of a problem. This gives us confidence around our patterns.

As we establish these patterns and learn more, we’re able to iterate on them and improve our deployments.

Did it work?

In short, yes. One of our services automatically flipped over to our European region when some of its instances had problems.

Queries to the ap-southeast-2 API:

ap-southeast-2 traffic

Queries to the eu-central-1 API:

eu-central-1 traffic

Wow, that sounds manageable, so why not just do it for everything?

Unfortunately, doing something like this doesn’t come for free. You’re looking at doubling your infrastructure costs by going multi-region. It takes well architected systems to function under “eventual consistency”, and to be decoupled in a way that allows redundancy in appropriate parts of the infrastructure. Making your infrastructure immutable comes at some automation cost. We aim for all of these anyhow, but some systems are in a more mature state than others.

In cases of writeable data, life gets much harder. Most of our multi-region systems have data travelling in one direction only, to avoid the inevitable consistency/availability trade offs.

And in some cases, it’s just not worth it. Either the SLAs don’t indicate a need for multi region, or the system isn’t critical enough to justify the engineering or infrastructure expense.

The People Factor

None of the above would be possible without teams of developers and operational staff who care about this stuff, and a great “dev ops” culture. Within our teams, developers collaborate closely with operations staff to understand how our systems work and need to be deployed to withstand various failures. There is much more to building a piece of software than just “writing the code”—it is the culmination of all aspects of running successful products.

And more than that, we have great teams that self-organise when an incident occurs. The uncertainty around how long this incident would last worried us. Within minutes of AWS experiencing issues, a number of people had convened on a Slack channel discussing the issue, and over time more arrived. People just wanted to help, and with no egos and amazing self organisation, people took on roles that helped patch the various holes and problems that appeared. We had people on the ground (so to speak) who were helping each other out, even outside of their usual areas of expertise. At no stage did if feel like a panic, and people were able to remain calm and focussed throughout. A number of engineering managers and C-level execs popped into the channel to watch, but were happy to keep quiet and out of the way.

I feel it was a real testament to our DevOps culture at REA, and our pride and commitment to the systems we build and operate, that enabled us to work through and resolve the issues we had as a result of this incident.

Overall

While we didn’t get away totally unscathed, we managed without majorly impacting a large number of our consumers through a combination of dedicated, talented people, well architected and engineered solutions (including our use of a multi-region setup), appropriate choice of technology, and a bit of luck from our continued use of a data centre. As we continue our move of our core systems from the data centre to the cloud, this event has shown us that whilst we have things to learn still, our underlying principals around how we handle these sorts of failure check out ok – we have confidence that we’d be ok in the future.