Deploying a high traffic website with zero downtime is a challenge – there’s a natural tradeoff between:
- Performance and cacheability.
- Getting updates versions of the application live.
The approach you use to manage your static assets plays a big role in this.
This post explains how we dealt with the challenges in our move from the data centre to a multi region highly available cloud-based architecture.
What are static assets?
When a consumer loads a web application, the server provides (among other things):
The static asset content can become quite large – so there’s a need to ensure content is served from cache as much as possible. This is usually a combination of in-browser caching along with caching in a content delivery network (CDN). At realestate.com.au we use Akamai as our CDN.
A common way of doing this is:
- Append the static asset filenames with a unique hash generated from the file contents. (any change to the content will mean a new filename)
- Provide short cache expiry headers (or no-cache) on the initial HTML page.
- Provide a really long cache expiry header on the hash appended static assets.
This results in the bulky static assets being heavily cached, but a new deployment quickly switching consumers over to new versions of the required static assets.
A small problem
This approach is sound however there is a key assumption – consumers making multiple requests always end up hitting a server running the same version that served the HTML.
In reality, deploying changes are not atomic, so an example like below is perfectly possible:
In the past, we’ve used a number of strategies to tackle the problem:
- Using a load balancer to rapidly switch between two clusters of servers.
- Using DNS changes to less rapidly switch between two clusters of servers.
- Making our applications serve mismatching assets and hope things don’t break much.
- Closing our eyes and pretending it’s not actually a problem
Here’s a visual representation of our approach:
A small problem gets bigger
The problem gets worse when we move to a highly available multi-region AWS deployment. We intentionally keep separate regions (Frankfurt & Sydney) completely decoupled and independent. Akamai uses latency-based routing to distribute traffic to the two regions. There’s no guarantee that all requests from a single consumer will be routed to the same region.
When we deploy a new version of the application, we upgrade both regions simultaneously with zero-downtime. It is, however, impossible to ensure the switchover happens at precisely the same time. Even within one region, different versions of the application can be serving traffic.
And of course the deployment to one region could completely fail leaving us with two versions of the application running for an extended period.
To get this to work without re-evaluating our approach would look something like this:
There were a few key attributes we were looking for in our solution:
- Be able to deal happily with mismatched versions – ideally for an extended period of time.
- Avoid complicating our (already complex!) CDN configuration.
- Avoid having our Dev / Test environments be more complicated, or differ from production.
- Avoid relying on stateful approaches like sticky sessions. (either sticky to one server or sticky to one AWS region)
- Ideally allow us to have longer cache expiry times even on the main HTML.
Publish static assets to S3
As part of the build, we publish the hash appended versions of the static assets to an S3 bucket, before deploying the application:
Create fallback asset retrieval in web application
We use NodeJS and Express to serve our in-application static assets.
It was a relatively simple change to configure the application with an extra piece of middleware when serving static assets:
- The file is first looked for on disk.
- If it can’t be found, the application looks for the asset in the backing bucket and returns it.
- If it’s in neither location, then a 404 response is returned.
To allow us to see when the fallback is triggered, the ‘fallbackToStaticAssetsFromS3’ middleware also set a header on the response when a file is served from s3 as well as logs a message.
This allows us to easily look in our log aggregation system (Splunk) and see how often this fallback is triggered without having to ship S3 logs around. (hint: it’s used every time we deploy)
Set retention period on the bucket
Given each release deploys a bunch of new assets to the S3 bucket, it makes sense for us to clean up after ourselves. To do this, we’ve set a retention policy on the bucket so that after a couple of months, the assets are deleted and we don’t get an ever-growing S3 bucket.
Add monitoring on the static asset bucket
Last but not least, we have some monitoring in place to check (via the application) that it can pull static assets from the backing S3 bucket. This protects us if someone were to delete the bucket or change permissions so the application could no longer talk to it.
We now have a reliable way of serving up static assets that mean we don’t need to care about how quickly our deployments go through. If a deployment to one region fails, we can fix it in our own time!
The only piece that ties our two regions together is that the Frankfurt stack uses the Sydney S3 bucket. We could easily fix this by replicating the bucket (or deploying to two separate buckets) but given this is only a problem during a deployment, we decided the extra complexity was probably not worth the effort.
The best bit is we’ve been able to do it:
- Without making our Dev/Test approach differ from production.
- Without complicating our Akamai configuration.