Hunting Down Memory Issues in Rails

On a Friday a few weeks ago, we deployed a set of minor changes to one of our Rails apps. That evening, our servers started alerting on memory usage (> 95%). Our initial attempts to remedy this situation by reducing the Puma worker count on each EC2 instance didn’t help, and the memory usage remained uncomfortably high through the weekend. On Monday, we popped open NewRelic and had a look in the Ruby VM section. Indeed, both the Ruby heap and memory usage of each web worker process had begun a fairly sharp climb when we deployed on Friday, after being totally flat previously:

Heap Size Growing

However, over the same period of time, the number of objects allocated in each request remained fairly static:

Object Allocations Over the Same Period

If our requests aren’t creating more objects, but there are more and more objects in memory over time, some of them must be escaping garbage collection somehow.

A Wild Goose Chase

At this point it was quite clear that something in the code we’d deployed was causing our web workers to consume an ever-increasing amount of memory. Unfortunately, we’d spent the week piling up changes in staging for QA, but hadn’t deployed to production in quite some time. As a result, the range of commits that could have been responsible was quite large. This is a compelling case for frequent (if not continuous!) production deployment.

Nonetheless, our first target for tracking down the leak was the git log. This turned out to be a bit of a dead end. None of the code was using any kind of global state that might have been responsible for leaking object references. However, I was initially suspicious of a circular reference in some of our presenters. Here’s a simplified version of the code:

class PropertyPresenter
  def description
    @description ||= PropertyDescriptionPresenter.new(self)
  end
end

class PropertyDescriptionPresenter
  def initialize(property_presenter)
    @property_presenter = property_presenter
  end
end

I thought maybe that Ruby’s garbage collector was confused since each of these objects was holding on to a reference to the other. This would be a problem for very simple reference-counting GC, but Ruby’s mark-and-sweep/generational GC handles circular references just fine. I was able to verify this with a simple script using the memory_profiler gem as outlined in the second part of this post by Sam Saffron.

This showed the object count flat, with very minor variations (both up and down) in each cycle. So much for that idea.

Profiling the heap

At this point we decided to bring out the big guns: dumping Ruby’s heap and analyzing it. As of Ruby 2.1, the ObjectSpace class gives you the ability to dump Ruby’s heap. This dump contains an enormous amount of data, down to information about which line of which file was responsible for allocating every object currently in memory. In order to get a dump, you need to first enable tracing of object allocations. This makes everything slower, so it’s best to avoid turning it on in production if you can help it. In our case, we wanted to first try to get snapshots from a server running locally on a laptop, so this was fine.

require 'objspace'
ObjectSpace.trace_object_allocations_start

You could put this in boot.rb or any similar app startup code. I just made a new Rails initializer and put it in there.

Now that you’re tracing object allocations, you need to make a call to actually dump the heap. Various articles we read suggested using rbtrace to attach to the running server process to do this, but since I was just going to be testing this locally (for now), I just added a controller method so I could hit an endpoint in my browser to dump the heap to a file:

class PropertiesController
  def dump
    io = File.open("/tmp/heap_#{params[:id]}", "w")
    ObjectSpace.dump_all(output: io)
    io.close
  end
end

With this in place I can visit /dump?id=1 in my running server to get a snapshot of its heap saved to /tmp/heap_1.

Then I fired up ruby_jmeter and threw a bunch of traffic at my local server’s property pages, using a CSV file of actual URLs generated from our staging database to ensure I was hitting a diverse range of pages. I took three heap snapshots, about 10 minutes apart (ideally you’d want to leave it longer to give your leak time to manifest itself, but this was a first attempt to verify that the whole thing worked).

Why three snapshots? As per this blog post:

The goal here is to get a list of all objects that were added between the first and second snapshot and were still present in the third snapshot. This isn’t a guarantee that those objects won’t get cleaned up even later, but it at least tells us we didn’t catch any very short lived objects.

That post also provides a script to diff the snapshots, which I encourage you to use. Here’s the top 20 lines of output it produced for my heap dumps:

Leaked 22444 STRING objects of size 533470/3263282 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/json-1.8.3/lib/json/common.rb:155
Leaked 17699 STRING objects of size 117740/735942 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/connection_adapters/postgresql/database_statements.rb:168
Leaked 10596 STRING objects of size 114667/423840 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/result.rb:110
Leaked 6402 STRING objects of size 74967/262034 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/connection_adapters/postgresql/database_statements.rb:162
Leaked 6139 OBJECT objects of size 0/439768 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/attribute.rb:5
Leaked 5455 STRING objects of size 31916/235392 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/type/string.rb:35
Leaked 2784 HASH objects of size 0/773952 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/json-1.8.3/lib/json/common.rb:155
Leaked 1245 STRING objects of size 21663/60979 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/puma-2.11.3/lib/puma/client.rb:168
Leaked 1229 OBJECT objects of size 0/107952 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/transaction/trace.rb:53
Leaked 1041 OBJECT objects of size 0/91608 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/transaction_metrics.rb:13
Leaked 938 ARRAY objects of size 0/45840 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/transaction/trace_node.rb:31
Leaked 913 OBJECT objects of size 0/65736 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/traced_method_stack.rb:35
Leaked 778 STRING objects of size 32035/79365 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/transaction_metrics.rb:13
Leaked 767 STRING objects of size 78970/128397 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/sprockets-3.2.0/lib/sprockets/path_utils.rb:17
Leaked 763 ARRAY objects of size 0/30520 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/connection_adapters/postgresql_adapter.rb:438
Leaked 760 ARRAY objects of size 0/30400 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/connection_adapters/postgresql_adapter.rb:434
Leaked 754 DATA objects of size 0/96416 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/connection_adapters/postgresql_adapter.rb:438
Leaked 630 ARRAY objects of size 0/177264 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/connection_adapters/postgresql/database_statements.rb:168
Leaked 573 STRING objects of size 1895/22920 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/core_ext/numeric/conversions.rb:131
Leaked 564 STRING objects of size 3353/22560 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/instrumentation/action_view_subscriber.rb:77

It’s certainly quite dense, but what’s the upshot? Most of the top sources of objects persisting between snapshots were created by ActiveRecord, the NewRelic agent, the JSON gem, or the Puma web server. I don’t have any particular reason to suspect any of these sources, as the versions hadn’t changed with the deploy that caused the memory to spike. None of the top 20 were files in our app, which was a bit discouraging.

One line did stand out as unlike the others, though:

Leaked 767 STRING objects of size 78970/128397 at: /Users/louis.simoneau/.gem/ruby/2.2.0/gems/sprockets-3.2.0/lib/sprockets/path_utils.rb:17

For those unaware, sprockets is Rails’ asset compilation and packaging library, used as the backend for the asset pipeline. This stood out for two reasons: firstly, sprockets is used to compile assets, which happens as part of deployment. Why would the running application be generating so many sprockets path strings? Secondly, I knew that one of the changes we’d made was using a feature of the asset pipeline in a somewhat unconventional way.

The Smoking Gun

The app makes use of a few strategies to show feature images on property pages. First, we check to see if there’s been a listing on realestate.com.au for the property. If so, we’ll use the listing image. Otherwise, we default to Google StreetView. About a week earlier, a requirement came in that we should be able to override certain property images manually in exceptional extenuating circumstances. This feature was implemented by adding the images to the app repo with a name matching the property’s URL slug, and checking for their existence using Rails’ built-in asset helpers:

<% if Rails.application.assets.find_asset("properties/#{@property.rea_address_uri}.png") %>
  <div class="main-image"><%= image_tag("properties/#{@property.rea_address_uri}.png") %></div>
<% else %>
  <div id="streetview" class="main-image streetview"></div>
<% end %>

When I saw the Sprockets line in the heap diff, I immediately thought back to this code. What does find_asset actually do? Let’s look at the source in the docs:

Docs for find_asset

“Cached”, you say! It turns out that every lookup for an asset goes through a cache holding a hash of filenames and their status (or nil if the file doesn’t exist). The issue arises because we’re looking for a .png file with a name matching the URL slug of each property … and there are roughly 11.5 million properties. Assuming 50 bytes for each string key (which is quite conservative), that gives us a hash somewhere in the vicinity of 500MB per worker process once it’s served up every page on the site. Thus, each worker process will gradually grow in size, putting on a few bytes every time it serves up a new page. With 10 worker processes on a machine with 4GB of RAM, well…

The Fix

We implemented our own tiny image store, which is populated at app startup only with the images we have. The result:

All Systems Nominal

Bam!

Conclusions

This wasn’t really a memory leak, and it would certainly have been possible to track it down without resorting to any profiling tools. This would have been made considerably simpler if the number of changes that had gone into the release in question was very low. If the only thing that had changed was this custom photo lookup, we probably would have checked the docs or source of find_asset quite a bit sooner, and come to the right conclusion without needing to resort to memory profiling or the like. So the first take home lesson is an old one: release as often as you possibly can!

The second lesson: don’t get attached to your hypotheses. The circular reference idea was the only thing that looked even remotely leaky in the code diff, but I probably spent a bit more time hunting it down than I should have. When you don’t have data, don’t spend too long chasing your ideas instead of getting data!

If there were to be a third lesson, it might be about being wary of using libraries for purposes they weren’t designed for. The sprockets filesystem cache makes perfect sense given most applications will have a few dozen (or at most a few hundred) static assets. The images we wanted to lookup weren’t app assets, they were app data. The feature was implemented extremely quickly because it was seen to be a one-off special case and not core to the app’s function. Everyone on the team who saw the code at the time said something to the effect of “huh, that’s a clever trick!” But while some good engineering involves clever tricks, not all clever tricks are good engineering!

  • Sp4cecat

    Also: Don’t deploy on Friday ..

  • Swapnil Abnave

    When I add

    require ‘objspace’

    ObjectSpace.trace_object_allocations_start

    in boot.rb(or any initializer) I get a long trace of error ending with a note.

    [NOTE]

    You may have encountered a bug in the Ruby interpreter or extension libraries.

    Bug reports are welcome.

    For details: http://www.ruby-lang.org/bugreport.html