Bork Night – A Series of Successful Failures!

On Thursday 23rd of February, the Site Operations team held their first Bork Night.  This was an exercise in resilience engineering by introducing faults into our production systems in a controlled manner, and responding to them. The senior engineers designed a number of faults, and challenged the rest of the team to identify and fix them.  This ended up being a lot of fun, and we came away with a good set of learnings which we can apply back into our systems and processes to do better next time.

Bork Night - The team at work fixing failures

The format of the challenge was:

  1. Teams were assembled (we had two teams of two, and one of three).
  2. The senior engineers set up their faults, and introduced them into the production environment.
  3. A fault was assigned to each team, who then had 10 minutes to Evaluatetheir problem.  No attempts to fix were permitted at this time.
  4. The problems were then handed-over from one team to the next, there was 2 minutes given to do this.
  5. The next team then had 10 minutes to continue evaluating the problem, building upon what the first team to look at the problem had learned.
  6. There was one more phase of hand-over and evaluation.  We then let all the teams try to agree with each other what each fault was about.
  7. We then let the teams prioritize the faults, and create new teams however they saw best to fix each problem.   This started the Reaction phase. (Originally we were planning to do rotations of this React phase around each team after 10 minutes, but changed our approach on the fly.)
  8. Later, we had a debrief over pizza and beer.

Trent running the bork night and setting up the failures

The challenges presented were:

  • Duplication of a server’s MAC address, causing it to not respond. Normally, every server on the network has a unique address so that information can be routed to it.  A new virtual machine image was created with a duplicated MAC address.  This confuses the network as it can no longer route packets of information to the correct server, causing anything that depends on that server to start failing.  We picked on a key database server for this one.  Kudos to Gianluca  for discovering the cause of this enabling a quick recovery by shutting down the offending duplicate machine.
  • A failure of a web server to reboot. After deletion of boot configuration, the web server (normally used for realestate.com.au) was made to shut down.  Because the boot information was (deliberately) deleted it would not restart.  The machine had to be fixed using a management interface by copying the boot config from another machine.  Congrats Daniel for speculating correctly the cause of this.
  • Forcing several property listing search server to slow down after becoming I/O bound. This fault did not hamper us as badly as we thought it might.  On several of the FAST query node servers, which normally power the property searches on REA and RCA we caused them to slow down by saturating their ability to read information off the disks.  On one hand this was a reassuring surprise that our systems were resilient enough against this kind of problem, and we later realized better ways we could introduce this sort of fault in future by ensuring the search service did not have anything cached in-memory first.
  • And as an extra bonus complication during the event, we deliberately crashed the Nagios monitoring service, so that the teams had to re-prioritize their incident response partway.  Kudos to Patrick for figuring out the full extent of what was broken and getting Nagios up and running again.

Working through the failures on Bork Night

Several things worked well, some things we can do better.  Our learnings included:

  • Our emergency console and DRAC access to our servers is not smooth, with confusion over what passwords to use, and limitations of single-users at a time.
  • In future, the scenario design should try to avoid overlaps where they affect the same services as other scenarios.
  • Some scenarios need better fault-description information.
  • We need a venue where the monitoring dashboards can be shown, such as naglite.
  • Wifi access continued to plague us.
  • Outsider insight was a fantastic asset.  We had developer participation, and while they might not have had technical familarity with the siteops details, there were great insights coming from them as to where to focus the troubleshooting.  The next Bork Night really needs to welcome non-siteops participation.

Finally, a big thank you to Tammy for arranging the pizza and drinks!

(Reposting Aaron Wigleys post from the realestate.com.au internal community site)

Kanban in Operations – Virtual Card Wall

Three months ago I joined the Site Operations team at realestate.com.au and I was pleased to see that the team were using a card wall for work.

Card Wall

Although the physical card wall proved to be a great place to have stand ups and manage work, it had its problems:

  • We have a distributed team. With operations teams in Italy (casa.it) and Luxembourg (athome.lu), people on devops rotations and working from home on occasion makes it hard for them to participate during stand up.
  • Data associated with cards such as creation timestamps, creators etc. is dependant on users writing it on the cards.
  • Limited external visibility into Site Ops work load. If any one wanted to know what we are currently working on, they would have to head up to the Site Ops area and have a look.

After a discussion with the team, we decided to trail a virtual card wall.

Scope

The trial would run for two weeks, replicating the cards on our physical card wall, with a retrospective and decision to continue at the end.

The trial would not include capturing incidents or deployments and would be light as possible.

Setup

To get the trial up and running as soon as possible, we utilised our existing Jira installation with Greenhopper. The project setup and configuration was kept to a bare minimum.

We created five new issue types, based on the cards on our physical wall – Service Requests, Deployment, Provisioning, Housekeeping and Faults.

Card Types

A week before the trial commenced, we manually imported the cards into Jira and wrote the Jira issue number on the cards. During that week we also duplicated the any new physical cards into Jira. This allowed us to start tracking behaviour before we started the trial.

Card

Our virtual card wall is tactile. Stand ups would now be conducted in front of a Smart Board, which allowed us to interact with Greenhopper using our fingers as the mouse.

The Trial

The trial kicked off on Friday 8th July at 0900, we had our regular stand up with the exception of the new virtual card wall.

Stand up

In addition to Greenhopper, we started a trialling weekly iterations (versions) in Jira – Thursday to Thursday.

Although we weren’t planning the iterations, the option is there for participants to put cards into a few iterations later if the card won’t be actioned for a few weeks.

What works and what doesn’t?

The trial of Greenhopper has been great. The trial has identified a few things that work well, and some that don’t. So what works and what doesn’t?

  • It’s difficult to raise new cards at stand up. It’s a change to our regular process of raising cards at stand up, as we have to create and edit cards before or after stand up. However this has minimised interruptions during stand up, allowing the team to focus on stand up.
  • We are able to raise cards wherever we have access to a web browser and we are not constrained to being in the office.
  • For a few of the stand ups we didn’t have access to the Smart Board and used a projector instead. It felt awkward. Having physical interaction with the card wall definitely enhances the experience. It feels natural for the team to huddle around the card wall, rather than a computer.

What’s next?

So what’s next for the Site Operations Greenhopper integration?

  • First up is to trial the system to the global operations teams with a possible change to our stand up time to a more sensible hour for our European colleagues.
  • Next is to increase transparency into Site Operations current work load. To achieve this we will look into publishing a read-only card wall to the wider company.
  • Start planning work for iterations. We didn’t plan beyond one week during the trail, but we are collecting data on how long cards are taking to cycle through our system.
  • Estimating card size again. Based on  the data collected we should be able to reliably estimate work and compare that to the actual durations.
  • Customise Jira to suit the work flow in Site Operations, including incident management and deployments. This will be an evolutionary process, with an aim to try and keep the work flow as light as possible.
  • The final goal is to investigate integration with other operations systems, such as ZenDesk and Nagios. This would minimise the amount of duplicated for and streamline our work flow.
(cross posted on geekle.id.au)

Your build information on your android device!

The healthy competition that prevails in the mobile industry today is primarily between iOS and Android. This competition is mainly fuelled by the number of applications written for both operating systems by developers across the world. Both have their unique features which most applications attempt to exploit. Recently, after coming back from here and hence decided that I should attempt to begin with android and write an application that I would want to use day in and day out! From this thought process, emerged “Blamer“.

We are firm believers in Test Driven Development and one of the main tenets of TDD is quick feedback. Keeping this in mind, I recently wrote the above mentioned Android Application.

It connects to your Jenkins CI server and downloads information about the builds. It does not connect to any other CI server yet but I am working on it!

You can individually step into broken builds and see how, why and when they failed and if the person whose commit broke the build exists in your phone book, you can actually SMS them about the failure.

The application is available on the android market and the source code is freely available on my GitHub account.

Feel free to fork the code and contribute. Any comments and/or feedback will be most welcome.

devops, and associated thoughts!

Here at REA, we have been firm believers in accountability of work, from the inception to delivery! To cement this idea, we instituted the concept of devops here some months ago, were a dedicated OPS person would co locate with the development teams for particular projects and be just like another developer. Conversely, a developer from one of the teams would colocate with the Operations team and experience being a systems administrator. The reasons were to get members of both teams aware of the fine grained issues and complexities that develop within the processes of development and/or deployment. We frequently rotated the personnel through this process to better spread the lessons learnt. Being an observer of this process, I have come to a few conclusions.

1. One person cannot be a ‘devops’ person.

It is not possible to have one person in a position where he/she is expected to do deployment and development while all others are expected to do only one of the above. The responsibility, and thats what it is, has to be shared across the team. It is the responsibility of the entire team to ensure that that the delivered product works in production, and not just one person.

2. It is a mentality, not a position.

It should not be marketed as position to slot one person in, or even rotate people through. A software developer who writes the code should be as concerned about his/her code working in production as he/she is in development. Definitely operations are more experienced with deployment and maintenance of products in production like environments, but that does not mean a developer should not base his software decisions before being aware of implications of deploying it in production, or be unaware of the status of his code after it passes quality assurance. The thinking behind devops is the critical bit of software development.

3. Operations need to be consulted early and often.

Operations are somehow considered as the team to call in last, when you have your product ready or almost ready. That in my humble opinion, is a fallacy. They need to be involved in every inception, every retrospective and every card kick off as any other team member. Their thoughts are most invaluable as they will look at a piece of work, not as whether it can be done, but whether it should be done based on the present state of production; whether there are some inherent dependencies that need to be resolved. Not only will this ensure that software developers will start writing software with production in mind, rather than their own laptop, but also the fact that this will spread the knowledge of state of production to people who might not have it to begin with.

devops is not a magical word that resolves our issues of deployment, dependencies etc, but it does get us started on the road of understanding that we are working to deploy to a production box, not our own laptops; and it is the responsibility of everyone involved to make sure that happens, and not just one lone person.

Innovation Day, March 2011.

At realestate.com.au, we are committed to several core values that drive our business and keep us ahead in the market; both in terms of business value that we deliver and also the quality of the products.

One of our core values is innovation. We, as a company, are committed to innovation and provide an avenue for the entire business to take some time during business hours and try and do something with our products, processes or even the way we think about it, and turn it into something different. As a means of realising this, we have instituted “Innovation Day”. A day and half every quarter to do what you will with existing or new products, processes or tools and turn it into something interesting and/or useful for the business.

Recently, we had our first innovation day for this year, on 31st of March, 1st of April , 2011. The participation was vigorous and all the Q/A sessions involving each presentation were very in depth and illuminating. One of the most pleasing aspects of the presentation was how well thought out the future scope was, for each idea. Hopefully, we will soon be able to put these ideas into business practice.

Some pictures from the event:

Thanks to all the people who participated in this event and to all those who facilitated it.

Special thanks to Sam Weller and Mike Breeze for organising one of the best innovation days this company has had.