HomeAbout UsNews, company reports and blogBlogTechOur journey from Ruby 1.8.7 to 1.9.3

Our journey from Ruby 1.8.7 to 1.9.3

23rd Aug 2013 — Read in 8 mins

Recently REA made the move to Ruby 1.9.3 from Ruby 1.8.7 for our listing administration tool, a large Rails application used by realestate agents to manage their listings. The endeavour was ultimately successful, but not without significant challenges.

The most notable of these was a nasty segmentation fault. This fault at one stage caused so much pain that we believed that sharing our discoveries was necessary, in the hope that we might be able ease the pain for someone else.

Background

Like all projects, the upgrade had a number of inherent challenges and restrictions. We had to fit it in between major project priorities. We also had to make sure that one of our shared libraries, which describes common domain objects for several other internal Rails applications, maintained compatibility with 1.8.7. This effectively meant that the build pipeline needed to create artifacts for both 1.8.7 and 1.9.3, and the source had to be compatible with both.

Initially we discussed the possibility of moving from 1.8.7 directly to 2.0, but after some investigation we decided to pull back from this path. Amongst other things, Phusion’s recommendation against using Passenger on Ruby 2.0 at that point in time was a major concern for us. Passenger is one of our core infrastructure pieces for production.

We also noticed several reports of segmentation faults, which we wanted to avoid. (I guess you could say that hindsight is a wonderful thing). Our agreed approach was to move to 1.9.3, knowing that the future step from 1.9.3 to 2.0 would be a much simpler one.

Moving shared domain library to 1.9.3

The first step was to move the shared domain component to 1.9.3 in a backward compatible way. As previously mentioned, this component is shared by several applications, some of which are still running on 1.8.7. To enable this, we added two separate build pipelines to the mix; one for 1.8.7, and one for 1.9.3. We also included a simple syntax checking component, to ensure that if a developer was running in a 1.9.3 environment locally, their syntax would continue to compile against 1.8.7.

We used RVM to maintain ruby versions across environments. By defining a couple of very useful alias functions, we made it easy to switch ruby versions:

function use_187 {
  rvm --create use ree-1.8.7@cp-domain
}

function use_19 {
  rvm --create use ruby-1.9.3-p392@cp-domain
}

function use_2 {
  rvm --create use ruby-2.0@cp-domain
}

This made it much easier to test code changes against multiple Ruby versions locally. Domain library upgrade went very smoothly, so smoothly that it lulled us into a false sense of security.

Upgrading Agent Administration Tool to Ruby 1.9.3

Updating the agent administration component to use 1.9.3 proved much more difficult than updating the domain layer library was. After moving all of the code over and getting most of our unit tests to pass, we hit our first major hurdle. The new character encoding behavior introduced in 1.9.3.

For those who have not previously encountered this, there is a subtle but substantial change in the way Ruby handles strings between 1.8.7 and 1.9.3. Previously, all character strings were stored in a common ‘binary’ format, but as of 1.9.3 each string stores both the bytes representing the data and a marker describing its character encoding. This enables developers to include unicode and other non-latin-based character sets.

Unfortunately, our MySQL database was encoded with latin1. If we needed to store a special character which was not part of the latin1 character set, for example an ellipsis symbol (that hurt! A two byte UTF-8 character in a one-byte encoded character set!), then these characters were pushed through to MySQL using whatever escape code Ruby used to store that character internally. This worked because neither Ruby nor MySQL were pedantic about their encoding.

The database encoding was set in stone. We were not permitted to re-encode it because numerous other applications were assuming the ability to read and write latin1. However, the new 1.9.3 runtime required that any special characters from our application be encoded using a character set which had explicit support for them, ie, UTF-8.

We tried a few different options. Storing UTF-8 strings directly in MySQL caused other applications to be unable to read them or query their contents. Re-encoding Ruby UTF-8 strings as latin1 caused Ruby errors for the ‘illegal’ characters. Adding alternative transcoding rules broke on multi-byte characters. Stripping special characters broke numerous business requirements.

In the end, we found a configuration tweak which allowed us to trick the MySQL drivers into behaving the same way they had in the earlier Ruby version. It was not quite what we would have liked, but it worked well enough for us to consider the issue ‘solved’.

Segmentation Fault

We were on the home straight, all our tests were green, we had our beers cooling in the fridge, and then all of a sudden, bang!

Segmentation fault.

Segmentation faults are an order of magnitude more severe than tests failing. They are bad news in general. But ours was especially problematic. We could reproduce the problem by running our entire suite of cucumber tests, but each time we ran it it would fail at a different place in the code. Adjusting our garbage collection settings affected the frequency of the segmentation faults, but nothing we tried could completely prevent the problem.

Having tried all the obvious contenders, we turned to Google in the hope of finding a quick solution. There were dozens of reports of segmentation faults, but none of them quite matched our symptoms. Nothing led us to a clear solution.

The most common issue reported on-line was a fault caused by nokogiri and its dependency libxml2. We spent days thinking that it is nokogiri but we were not making any progress. Doubly frustrating, any time we asked colleagues from other teams if they had suggestions, their first thought was always ‘it’s probably nokogiri’.

By this stage all the members of the team had come to the conclusion that we were dealing with a serious problem. So we decided to ‘Swarm’. Every member of the team dropped what they were working on to try and resolve this issue.

Lots of ideas were tabled, lots of strategies proposed, and our most promising leads pointed to garbage collection and native libraries… But which ones?

The biggest part of the issue we faced was the fact that the fault occurred randomly. Sometimes the tests would pass completely, other times you could get faults occurring for three build runs in a row.

Ultimately a hard problem to solve, we initially tried to solve the problem based on the evidence we had observed, then we moved to trying to make the failures more consistent and specific. Neither approaches produced results.

Here is a quick summary of the things we tried:

Upgrading Gems:

Upgrading nokogiri to 1.6.0
Upgrading selenium-webdriver, capybara, cucumber and cucumber-rails
Upgrading firefox to latest selenium compatible version
Removing suspicious dev gems (like pry, beagle, guard, guard-rspec etc)
Removing libv8
Upgrading various other gems to the latest version
Running without rake to exclude it

Upgrading platform dependencies:

libxml2, libxslt, libiconv, zlib

Environment changes:

Split cucumber from rails rack-test and ran it in a different process (the first run cucumber crashed, the second time the rails app crashed)
Alternative rails app servers(Thin, Webrick)
Adjusted environmental ulimits(both stack size and open file descriptors limit)
Analyzed heap dumps and crash logs.
Analyzed crashes with dtruss on mac and strace on linux(not much information – different system calls each time before it crashes)

Miscellaneous:

Checked if app crashes on installed boxes (running passenger)
Checked if app segfaults when run as a different process

At this stage we were close to declaring failure and postponing the whole endeavor indefinitely. We were out of good ideas, and weren’t all that enthusiastic about trying out our bad ideas. We were feeling a little sorry for ourselves.

Calling in support from other teams

At this very late stage we decided to make a more formal call on our technical peers within the company for help. There are a lot of really smart people in our company, and we hoped that they might offer some support on how we can move forward on this issue. ‘Fresh eyes’, and the like etc.

Out of the meeting we got some really valuable suggestions:

Reducing the surface area (Reduce the number of gems & dependencies & retry)
Get the list of native gems & find if there were any bug reports filed against them.
Find a way to reproduce the errors quicker
Get help from somebody in the greater Ruby community (core committers to rails or ruby)
Keep testing with 1.9 and still continue in prod with 1.8.7
Observe the code in a prod/simulated environment with a couple of nodes running on 1.9.
Check cucumber options
Try updating the mysql gem

So with this renewed idea set we went back to the trenches and continued to swarm for one more day before we cut our losses and dropped the migration for a while.

The silver bullet?

At the 11th hour, one of our colleagues found something very interesting. He was convinced this issue was something to do with one of our low level compiled gems, so he had been trawling through github. He managed to find a possible solution, based on this post.

He was able to create a stand-alone code case which triggered the “segfault” every time, something we hadn’t been able to do.

require 'xml'

loop do 
  doc = XML::Document.string('<foo><bar/><baz/></foo>') 
  node = doc.root.first.remove! 
  doc.root.last.next = node 
end

There was a known issue with libxml-ruby, which would cause a segmentation fault under ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-linux], exactly the same version as us. We were onto a winner!

He noticed that the shared domain library was using both the libxml-ruby gem that was linking through to libxml2, but also that the sax-machine gem was also linking through to libxml2. We had introduced both gems into the project in response to a past performance issue during start up of the application. By removing the libxml-ruby gem from the project and testing the application a few times, he was able to confirm that our normal steps were no longer reproducing the segmentation fault. Not yet conclusive, but promising!

Gaining Confidence

One concerning aspect was how to prove that the issue was actually fixed, not just improved. We didn’t want to release an artifact that could “segfault” in production randomly, even if it only happened rarely. We needed to reach a level of confidence that the problem was solved.

We decided to take advantage of Amazon AWS computing power. To the cloud!!! Given that our organisation is a pretty mature AWS user, we were easily able to set up extra large AWS node running our application. We set up two nodes running our cucumber test suite in an infinite loop. Given a full cucumber run takes us ~40 minutes, we were able to execute (36*4 = 144) test over 24 hours without any failures. We were now pretty confident that our problem was solved.

We made it, lessons learned

We made it. Although this experience was quite painful for the team, we have definitely learnt a lot. If we had our time again, we would definitely do the following:

Pair swap often on these upgrade tasks
Find a way to consistently replicate the issue with a quick turnaround time
Swarm early to put focus on an issue and try and solve it without it lagging on
Get advice and ideas from as many people as you can. Put pride aside and don’t be scared to ask for help
Take advantage of cloud computing power to run automated tests in anger to gain confidence that the problem has been solved

Written by the AD2 Team

Adam Tohovitis

View profile

Our journey from Ruby 1.8.7 to 1.9.3

Adam Tohovitis

More from the blog

My first rotation as a Springboarder at REA.

From Teaching to Software Development – A Year On

A vision of the future is a bet we are prepared to make

One year at REA: Lessons, Growth, and Gratitude

Introducing Argonaut – Part Three.

Introducing Argonaut – Part Two.

About Us

Investors

Careers

Social impact

Our journey from Ruby 1.8.7 to 1.9.3

Adam Tohovitis

More from the blog

Social channels

Site navigation