This is an Ignite I gave at DevOpsDays Sydney 2016.
Like a lot of on-call systems ours was once pretty terrible. We had 10 operations staff on-call, and we were responsible for every system in the company. Spending a week on-call usually meant a week without sleep.
Eventually one of the on-call engineers got fed up with this, so they removed every single check from the after-hours notification period. They then went through and selected only the most critical ones to put back in. Continue reading
We’re big fans of continuous deployment here at REA. Merging a pull request and seeing the changes flow all the way to production in a matter of minutes is really awesome. Unfortunately, even with a large number of automated tests, this also makes it possible for an uncaught bug to flow all the way through as well.
We recently experienced this when some new cache-busting code was mistakenly committed and caused our landing page to use a non-existent CSS file. Fortunately we noticed quickly and so the user impact was minimal, but it highlighted that the tests in our deployment pipeline were not as effective as we would like them to be. Continue reading
Rabbits, they are small, the are cute, they are fluffy…. and they are terrible fighters*.
Which is why evolution has favoured the skittish amongst them.
The rabbit which was overly cautious and ran away and the first hint of danger survived and went on to produce more rabbits. But the rabbit which was more laid back and waited until it was sure it was in peril before trying to flee did not. Of course running away does have a cost, and if you spend all your time running at the slightest sound you’ll expend a lot of energy and have no time to nibble at the grass and gain more**.
So what do rabbits have to do with Auto Scaling Groups (ASGs)?