Rails: Troubleshooting a memory leak on Heroku ( perhaps Nokogiri)

Question

I am using Rails 3.1.1 and deploying on Heroku. I am using open-uri and Nokogiri.

I am trying to troubleshoot a memory leak (?) that occurs while I am trying to fetch and parse an xml-file. The XML feed I am fetching and trying to parse is 32 Mb.

I am using the following code for it:

require 'open-uri'   
open_uri_fetched = open(feed.fetch_url)
xml_list = Nokogiri::HTML(open_uri_fetched)

where feed.fetch_url is an external xml-file.

It seems that while parsing the xml_list with Nokogiri (the last line in my code) the memory usage explodes up to 540 Mb usage and continues to increase. That doesn't seem logical since the XML-file is only 32 Mb.

I have looked all over for ways to analyze this better (e.g. ruby/ruby on rails memory leak detection) but I can't understand how to use any of them. MemoryLogic seems simple enough but the installation instructions seem to lack some info...

So, please help me to either determine whether the code above should use that much memory or (super simple) instructions on how to find the memory leak.

Thanks in advance!

Frederick Cheung · Accepted Answer

Parsing a large xml file and turning it into a document tree will in general create an in memory representation that is far larger that the xml data itself. Consider for example

which is only 16 bytes long (assuming a single byte character encoding). The in memory representation of this document will include an object to represent the element itself, probably an (empty) collection of children, a collection of attributes for that element containing at least one thing. The element itself has properties likes its name, namespace pointers to its parent document and so on. The data structures for each of those things is probably going to be over 16 bytes, even before they're wrapped in ruby objects by nokogiri (each of which has a memory footprint which is almost certainly >= 16 bytes).

If you're parsing large xml files you almost certainly want to use a event driven parser like a SAX parser that responds to elements as they are encountered in the document rather than building a tree representation on the entire document an then working on that.

Rails: Troubleshooting a memory leak on Heroku ( perhaps Nokogiri)

Answers (2)

Related Questions