Scaling On-Call: from 10 Ops to 100 Devs

This is an Ignite I gave at DevOpsDays Sydney 2016.


Like a lot of on-call systems ours was once pretty terrible. We had 10 operations staff on-call, and we were responsible for every system in the company. Spending a week on-call usually meant a week without sleep.

A sleep deprived first responder waking up late at night to a buzzing phone and a laptop screen burning bright with nagios alerts

Eventually one of the on-call engineers got fed up with this, so they removed every single check from the after-hours notification period. They then went through and selected only the most critical ones to put back in.

person separating alerts into a giant stack to bin and a very small stack to keep

This reduced the amount of noise massively, but it still meant that when an alert came through, chances were, it was for a system you had never seen before. So you’d end up spending more time looking for the documentation than actually fixing it.

person looking at a complicated machine scratching their head wondering how it works

It wasn’t long before another engineer became fed up with this, so they decided to take the handful of systems which they had built and create their own after-hours schedule. Shortly afterwards the other engineers and teams started following this pattern.

three people struggling to carry a very large box full of servers while another person carries a very small box

Then myself and the other two remaining operations staff realised we’d either need to look after everything that was left by ourselves, or we could ask our developers for help. Fortunately our devs already had experience being on-call during the day.

one person offering to help another who just received an alert during the day while they are at work

So it wasn’t too difficult to convince a few of them to try going on-call after hours. Once a few of them had survived this process then others started to volunteer as well.

a developer asking to go on-call so they can learn more operations things

They were generally pretty excited about looking after the things that they had built after-hours, but were quite nervous about looking after systems that others had built. Systems which they didn’t have much to do with day to day.

a person looking at two complicated machines scratching their head saying they know how one works but are unsure of the other

So we tried our best to convince them that we didn’t expect them to know everything, and we told them a lot of incident war stories so that they could understand that half the time Ops have no idea what we’re doing either.

a person telling another person a story about a time the servers caught fire and they caught fire too

We also started running on-call training sessions. This let people know what was expected of them, what they could expect, how to use our monitoring systems, and to make sure they had access to all the systems they needed.

an operations person running a training session on ops-fu for some developers who are about to go on-call

We spent a lot of time improving our documentation too, putting it into a consistent format, and creating a central catalogue so it was easy to find. The time we have spent doing this has paid for itself so many times over.

a person making the awesome decision to read some well written documentation instead of calling and waking up their escalation

We also made sure they were empowered. We gave them root access to the systems they needed, and let them know that if an alert came through that wasn’t actionable or urgent that they were allowed to adjust or even delete it, and they didn’t need ops permission.

an empowered person dragging a bad alert into the recycle bin because it does not add any value

We started running formal handover meetings where we would look at the alerts that happened during the past week, and we’d make sure there were git issues for these. We’d also make sure that these issues were prioritised against our regular work.

people attending an on-call handover session discussing what broke in the previous week and creating git issues for those things

This often meant we had to involve our product managers in these discussions, and since we’ve started doing this our PMs have gained a much greater understanding of the costs of supporting a system 24×7.

a product manager looking sad because they realise how expensive supporting an application 24x7 is both financially and the human costs

Which enabled us to re-evaluate some of our SLAs. It has also helped product understand why we spend time on maintenance rather than just focussing on writing shiny new features.

a product manager telling some on-call staff that they have decided an application only needs to be supported during business hours and the on-call staff cheering

We also recognise that being on-call does have an impact on people’s lives, and that some people feel it more than others, which is why we prefer an opt-in or volunteer based system. We also try to reduce the impact by encouraging people to come in late if it was a noisy night, or to take a day off in lieu if they covered a public holiday.

people at work seeing that there were a lot of alerts the previous night while the on-call person is sleeping soundly at home

Of course getting paid is an important part of any on-call system. It’s the company’s way of saying thank you, and that they value your time. We feel it’s important to get paid not only for holding the pager, but also for any over-time that you work during the shift as well.

a manager with an awesome tie handing a stack of cash to an on-call engineer and thanking them for keeping the website up

We recognise that a blameless post-incident review process is critical to a healthy on-call process. It ensures things get fixed, and it takes a lot of the stress out of incident response. It’s reassuring to know that everyone will understand that you made the best decision you could with the information you had at the time.

a person running a post incident review highlighting the root causes of an incident and making a list of actions to ensure the problem does not happen again

And like everything at REA our on-call process is a big experiment, so we’re frequently trying new things, whether it’s rostering systems or ways of alerting, and it’s important to learn from these experiments so we run regular on-call retros so we can continuously improve.

people in a retrospective deciding to try something new, there are lots more cards in the happy column than in the sad or confused columns

One thing we definitely need to improve on is our cross team coordination, which has become much more difficult since splitting into multiple on-call pools. We also need to get better at our hands on incident response training—since we’re getting a lot less practice from real incidents.

different people getting different alerts on their phone and not realising that they are all related because the real problem is the network not the applications

Which goes to show that putting devs on pager has been a great success. Our systems are quieter, more resilient to failures, and overall much healthier—and so are we, because we’re all getting much more sleep.

the same first responder from the first image, this time they are soundly sleeping with their laptop closed because the on-system is now way better than it was and they are getting far fewer alerts


Crazy machine designs from the REA T-Shirts designed by the talented MrDougal