Like a lot of on-call systems ours was once pretty terrible. We had 10 operations staff on-call, and we were responsible for every system in the company. Spending a week on-call usually meant a week without sleep.
Eventually one of the on-call engineers got fed up with this, so they removed every single check from the after-hours notification period. They then went through and selected only the most critical ones to put back in.
This reduced the amount of noise massively, but it still meant that when an alert came through, chances were, it was for a system you had never seen before. So you’d end up spending more time looking for the documentation than actually fixing it.
It wasn’t long before another engineer became fed up with this, so they decided to take the handful of systems which they had built and create their own after-hours schedule. Shortly afterwards the other engineers and teams started following this pattern.
Then myself and the other two remaining operations staff realised we’d either need to look after everything that was left by ourselves, or we could ask our developers for help. Fortunately our devs already had experience being on-call during the day.
So it wasn’t too difficult to convince a few of them to try going on-call after hours. Once a few of them had survived this process then others started to volunteer as well.
They were generally pretty excited about looking after the things that they had built after-hours, but were quite nervous about looking after systems that others had built. Systems which they didn’t have much to do with day to day.
So we tried our best to convince them that we didn’t expect them to know everything, and we told them a lot of incident war stories so that they could understand that half the time Ops have no idea what we’re doing either.
We also started running on-call training sessions. This let people know what was expected of them, what they could expect, how to use our monitoring systems, and to make sure they had access to all the systems they needed.
We spent a lot of time improving our documentation too, putting it into a consistent format, and creating a central catalogue so it was easy to find. The time we have spent doing this has paid for itself so many times over.
We also made sure they were empowered. We gave them root access to the systems they needed, and let them know that if an alert came through that wasn’t actionable or urgent that they were allowed to adjust or even delete it, and they didn’t need ops permission.
We started running formal handover meetings where we would look at the alerts that happened during the past week, and we’d make sure there were git issues for these. We’d also make sure that these issues were prioritised against our regular work.
This often meant we had to involve our product managers in these discussions, and since we’ve started doing this our PMs have gained a much greater understanding of the costs of supporting a system 24×7.
Which enabled us to re-evaluate some of our SLAs. It has also helped product understand why we spend time on maintenance rather than just focussing on writing shiny new features.
We also recognise that being on-call does have an impact on people’s lives, and that some people feel it more than others, which is why we prefer an opt-in or volunteer based system. We also try to reduce the impact by encouraging people to come in late if it was a noisy night, or to take a day off in lieu if they covered a public holiday.
Of course getting paid is an important part of any on-call system. It’s the company’s way of saying thank you, and that they value your time. We feel it’s important to get paid not only for holding the pager, but also for any over-time that you work during the shift as well.
We recognise that a blameless post-incident review process is critical to a healthy on-call process. It ensures things get fixed, and it takes a lot of the stress out of incident response. It’s reassuring to know that everyone will understand that you made the best decision you could with the information you had at the time.
And like everything at REA our on-call process is a big experiment, so we’re frequently trying new things, whether it’s rostering systems or ways of alerting, and it’s important to learn from these experiments so we run regular on-call retros so we can continuously improve.
One thing we definitely need to improve on is our cross team coordination, which has become much more difficult since splitting into multiple on-call pools. We also need to get better at our hands on incident response training—since we’re getting a lot less practice from real incidents.
Which goes to show that putting devs on pager has been a great success. Our systems are quieter, more resilient to failures, and overall much healthier—and so are we, because we’re all getting much more sleep.
Crazy machine designs from the REA T-Shirts designed by the talented MrDougal