What your AutoScaling Groups can learn from rabbits

Rabbits, they are small, the are cute, they are fluffy…. and they are terrible fighters*.

rabbit-705759_640

Which is why evolution has favoured the skittish amongst them.

The rabbit which was overly cautious and ran away and the first hint of danger survived and went on to produce more rabbits. But the rabbit which was more laid back and waited until it was sure it was in peril before trying to flee did not. Of course running away does have a cost, and if you spend all your time running at the slightest sound you’ll expend a lot of energy and have no time to nibble at the grass and gain more**.

So what do rabbits have to do with Auto Scaling Groups (ASGs)?

Continue reading

Bork Night – A Series of Successful Failures!

On Thursday 23rd of February, the Site Operations team held their first Bork Night.  This was an exercise in resilience engineering by introducing faults into our production systems in a controlled manner, and responding to them. The senior engineers designed a number of faults, and challenged the rest of the team to identify and fix them.  This ended up being a lot of fun, and we came away with a good set of learnings which we can apply back into our systems and processes to do better next time.

Bork Night - The team at work fixing failures

The format of the challenge was:

  1. Teams were assembled (we had two teams of two, and one of three).
  2. The senior engineers set up their faults, and introduced them into the production environment.
  3. A fault was assigned to each team, who then had 10 minutes to Evaluatetheir problem.  No attempts to fix were permitted at this time.
  4. The problems were then handed-over from one team to the next, there was 2 minutes given to do this.
  5. The next team then had 10 minutes to continue evaluating the problem, building upon what the first team to look at the problem had learned.
  6. There was one more phase of hand-over and evaluation.  We then let all the teams try to agree with each other what each fault was about.
  7. We then let the teams prioritize the faults, and create new teams however they saw best to fix each problem.   This started the Reaction phase. (Originally we were planning to do rotations of this React phase around each team after 10 minutes, but changed our approach on the fly.)
  8. Later, we had a debrief over pizza and beer.

Trent running the bork night and setting up the failures

The challenges presented were:

  • Duplication of a server’s MAC address, causing it to not respond. Normally, every server on the network has a unique address so that information can be routed to it.  A new virtual machine image was created with a duplicated MAC address.  This confuses the network as it can no longer route packets of information to the correct server, causing anything that depends on that server to start failing.  We picked on a key database server for this one.  Kudos to Gianluca  for discovering the cause of this enabling a quick recovery by shutting down the offending duplicate machine.
  • A failure of a web server to reboot. After deletion of boot configuration, the web server (normally used for realestate.com.au) was made to shut down.  Because the boot information was (deliberately) deleted it would not restart.  The machine had to be fixed using a management interface by copying the boot config from another machine.  Congrats Daniel for speculating correctly the cause of this.
  • Forcing several property listing search server to slow down after becoming I/O bound. This fault did not hamper us as badly as we thought it might.  On several of the FAST query node servers, which normally power the property searches on REA and RCA we caused them to slow down by saturating their ability to read information off the disks.  On one hand this was a reassuring surprise that our systems were resilient enough against this kind of problem, and we later realized better ways we could introduce this sort of fault in future by ensuring the search service did not have anything cached in-memory first.
  • And as an extra bonus complication during the event, we deliberately crashed the Nagios monitoring service, so that the teams had to re-prioritize their incident response partway.  Kudos to Patrick for figuring out the full extent of what was broken and getting Nagios up and running again.

Working through the failures on Bork Night

Several things worked well, some things we can do better.  Our learnings included:

  • Our emergency console and DRAC access to our servers is not smooth, with confusion over what passwords to use, and limitations of single-users at a time.
  • In future, the scenario design should try to avoid overlaps where they affect the same services as other scenarios.
  • Some scenarios need better fault-description information.
  • We need a venue where the monitoring dashboards can be shown, such as naglite.
  • Wifi access continued to plague us.
  • Outsider insight was a fantastic asset.  We had developer participation, and while they might not have had technical familarity with the siteops details, there were great insights coming from them as to where to focus the troubleshooting.  The next Bork Night really needs to welcome non-siteops participation.

Finally, a big thank you to Tammy for arranging the pizza and drinks!

(Reposting Aaron Wigleys post from the realestate.com.au internal community site)

Kanban in Operations – Virtual Card Wall

Three months ago I joined the Site Operations team at realestate.com.au and I was pleased to see that the team were using a card wall for work.

Card Wall

Although the physical card wall proved to be a great place to have stand ups and manage work, it had its problems:

  • We have a distributed team. With operations teams in Italy (casa.it) and Luxembourg (athome.lu), people on devops rotations and working from home on occasion makes it hard for them to participate during stand up.
  • Data associated with cards such as creation timestamps, creators etc. is dependant on users writing it on the cards.
  • Limited external visibility into Site Ops work load. If any one wanted to know what we are currently working on, they would have to head up to the Site Ops area and have a look.

After a discussion with the team, we decided to trail a virtual card wall.

Scope

The trial would run for two weeks, replicating the cards on our physical card wall, with a retrospective and decision to continue at the end.

The trial would not include capturing incidents or deployments and would be light as possible.

Setup

To get the trial up and running as soon as possible, we utilised our existing Jira installation with Greenhopper. The project setup and configuration was kept to a bare minimum.

We created five new issue types, based on the cards on our physical wall – Service Requests, Deployment, Provisioning, Housekeeping and Faults.

Card Types

A week before the trial commenced, we manually imported the cards into Jira and wrote the Jira issue number on the cards. During that week we also duplicated the any new physical cards into Jira. This allowed us to start tracking behaviour before we started the trial.

Card

Our virtual card wall is tactile. Stand ups would now be conducted in front of a Smart Board, which allowed us to interact with Greenhopper using our fingers as the mouse.

The Trial

The trial kicked off on Friday 8th July at 0900, we had our regular stand up with the exception of the new virtual card wall.

Stand up

In addition to Greenhopper, we started a trialling weekly iterations (versions) in Jira – Thursday to Thursday.

Although we weren’t planning the iterations, the option is there for participants to put cards into a few iterations later if the card won’t be actioned for a few weeks.

What works and what doesn’t?

The trial of Greenhopper has been great. The trial has identified a few things that work well, and some that don’t. So what works and what doesn’t?

  • It’s difficult to raise new cards at stand up. It’s a change to our regular process of raising cards at stand up, as we have to create and edit cards before or after stand up. However this has minimised interruptions during stand up, allowing the team to focus on stand up.
  • We are able to raise cards wherever we have access to a web browser and we are not constrained to being in the office.
  • For a few of the stand ups we didn’t have access to the Smart Board and used a projector instead. It felt awkward. Having physical interaction with the card wall definitely enhances the experience. It feels natural for the team to huddle around the card wall, rather than a computer.

What’s next?

So what’s next for the Site Operations Greenhopper integration?

  • First up is to trial the system to the global operations teams with a possible change to our stand up time to a more sensible hour for our European colleagues.
  • Next is to increase transparency into Site Operations current work load. To achieve this we will look into publishing a read-only card wall to the wider company.
  • Start planning work for iterations. We didn’t plan beyond one week during the trail, but we are collecting data on how long cards are taking to cycle through our system.
  • Estimating card size again. Based on  the data collected we should be able to reliably estimate work and compare that to the actual durations.
  • Customise Jira to suit the work flow in Site Operations, including incident management and deployments. This will be an evolutionary process, with an aim to try and keep the work flow as light as possible.
  • The final goal is to investigate integration with other operations systems, such as ZenDesk and Nagios. This would minimise the amount of duplicated for and streamline our work flow.
(cross posted on geekle.id.au)