On Thursday 23rd of February, the Site Operations team held their first Bork Night. This was an exercise in resilience engineering by introducing faults into our production systems in a controlled manner, and responding to them. The senior engineers designed a number of faults, and challenged the rest of the team to identify and fix them. This ended up being a lot of fun, and we came away with a good set of learnings which we can apply back into our systems and processes to do better next time.
The format of the challenge was:
- Teams were assembled (we had two teams of two, and one of three).
- The senior engineers set up their faults, and introduced them into the production environment.
- A fault was assigned to each team, who then had 10 minutes to Evaluatetheir problem. No attempts to fix were permitted at this time.
- The problems were then handed-over from one team to the next, there was 2 minutes given to do this.
- The next team then had 10 minutes to continue evaluating the problem, building upon what the first team to look at the problem had learned.
- There was one more phase of hand-over and evaluation. We then let all the teams try to agree with each other what each fault was about.
- We then let the teams prioritize the faults, and create new teams however they saw best to fix each problem. This started the Reaction phase. (Originally we were planning to do rotations of this React phase around each team after 10 minutes, but changed our approach on the fly.)
- Later, we had a debrief over pizza and beer.
The challenges presented were:
- Duplication of a server’s MAC address, causing it to not respond. Normally, every server on the network has a unique address so that information can be routed to it. A new virtual machine image was created with a duplicated MAC address. This confuses the network as it can no longer route packets of information to the correct server, causing anything that depends on that server to start failing. We picked on a key database server for this one. Kudos to Gianluca for discovering the cause of this enabling a quick recovery by shutting down the offending duplicate machine.
- A failure of a web server to reboot. After deletion of boot configuration, the web server (normally used for realestate.com.au) was made to shut down. Because the boot information was (deliberately) deleted it would not restart. The machine had to be fixed using a management interface by copying the boot config from another machine. Congrats Daniel for speculating correctly the cause of this.
- Forcing several property listing search server to slow down after becoming I/O bound. This fault did not hamper us as badly as we thought it might. On several of the FAST query node servers, which normally power the property searches on REA and RCA we caused them to slow down by saturating their ability to read information off the disks. On one hand this was a reassuring surprise that our systems were resilient enough against this kind of problem, and we later realized better ways we could introduce this sort of fault in future by ensuring the search service did not have anything cached in-memory first.
- And as an extra bonus complication during the event, we deliberately crashed the Nagios monitoring service, so that the teams had to re-prioritize their incident response partway. Kudos to Patrick for figuring out the full extent of what was broken and getting Nagios up and running again.
Several things worked well, some things we can do better. Our learnings included:
- Our emergency console and DRAC access to our servers is not smooth, with confusion over what passwords to use, and limitations of single-users at a time.
- In future, the scenario design should try to avoid overlaps where they affect the same services as other scenarios.
- Some scenarios need better fault-description information.
- We need a venue where the monitoring dashboards can be shown, such as naglite.
- Wifi access continued to plague us.
- Outsider insight was a fantastic asset. We had developer participation, and while they might not have had technical familarity with the siteops details, there were great insights coming from them as to where to focus the troubleshooting. The next Bork Night really needs to welcome non-siteops participation.
Finally, a big thank you to Tammy for arranging the pizza and drinks!
(Reposting Aaron Wigleys post from the realestate.com.au internal community site)