REA’s journey with Amazon Web Services (AWS) began in late 2010 when we started experimenting with using the cloud for our dev/test infrastructure. In 2013 we launched our first cloud-only production infrastructure to handle the dynamic resizing and serving of our images. Since that time we have adopted an IT strategy that involved transitioning all systems to the cloud and have therefore run a hybrid cloud and data centre platform ever since. More recently we have also embraced micro-services which means the volume of systems that we run in the cloud has exploded. This blog covers how our usage of AWS accounts and VPCs has changed and what we propose to do next.
It’s not news to anyone any more, so I’m sure everyone knows Amazon Web Services (our major cloud infrastructure provider) suffered an outage within one of their availability zones on Sunday June 5th. AWS is split up into various geographic regions, and within each region, a number of availability zones. I’m going to assume most readers know about this, but if you don’t, check out Amazon on how they describe these things. On Sunday one of these availability zones suffered a “power event”, owing to Sydney’s wild weather on the weekend, bringing it to its knees. Lots of Australian based websites had major problems.
I joined REA’s Consumer & Brand Delivery Engineering team 8 months ago, it’s been a blast and I love working on the tech we use. We extensively use Docker, AWS, and Ruby to produce internal tools such as `shipper` that other teams use to ship their containerized applications.
We host our own Docker Registry, and we maintain a set of base images, such as `ubuntu-ruby2.2` which is an image based on the official Ubuntu, with Ruby 2.2 and a few other dependencies baked in. We want the teams at REA to use those images, because we control how they are built and we include libraries and dependencies widely used in the company. Continue reading
We’re big fans of continuous deployment here at REA. Merging a pull request and seeing the changes flow all the way to production in a matter of minutes is really awesome. Unfortunately, even with a large number of automated tests, this also makes it possible for an uncaught bug to flow all the way through as well.
We recently experienced this when some new cache-busting code was mistakenly committed and caused our landing page to use a non-existent CSS file. Fortunately we noticed quickly and so the user impact was minimal, but it highlighted that the tests in our deployment pipeline were not as effective as we would like them to be. Continue reading
On a Friday a few weeks ago, we deployed a set of minor changes to one of our Rails apps. That evening, our servers started alerting on memory usage (> 95%). Our initial attempts to remedy this situation by reducing the Puma worker count on each EC2 instance didn’t help, and the memory usage remained uncomfortably high through the weekend. On Monday, we popped open NewRelic and had a look in the Ruby VM section. Indeed, both the Ruby heap and memory usage of each web worker process had begun a fairly sharp climb when we deployed on Friday, after being totally flat previously:
However, over the same period of time, the number of objects allocated in each request remained fairly static:
If our requests aren’t creating more objects, but there are more and more objects in memory over time, some of them must be escaping garbage collection somehow. Continue reading
Rabbits, they are small, the are cute, they are fluffy…. and they are terrible fighters*.
Which is why evolution has favoured the skittish amongst them.
The rabbit which was overly cautious and ran away and the first hint of danger survived and went on to produce more rabbits. But the rabbit which was more laid back and waited until it was sure it was in peril before trying to flee did not. Of course running away does have a cost, and if you spend all your time running at the slightest sound you’ll expend a lot of energy and have no time to nibble at the grass and gain more**.
So what do rabbits have to do with Auto Scaling Groups (ASGs)?