Reputation: 2411
I am using Rails 3.1.1 and deploying on Heroku. I am using open-uri and Nokogiri.
I am trying to troubleshoot a memory leak (?) that occurs while I am trying to fetch and parse an xml-file. The XML feed I am fetching and trying to parse is 32 Mb.
I am using the following code for it:
require 'open-uri'
open_uri_fetched = open(feed.fetch_url)
xml_list = Nokogiri::HTML(open_uri_fetched)
where feed.fetch_url is an external xml-file.
It seems that while parsing the xml_list with Nokogiri (the last line in my code) the memory usage explodes up to 540 Mb usage and continues to increase. That doesn't seem logical since the XML-file is only 32 Mb.
I have looked all over for ways to analyze this better (e.g. ruby/ruby on rails memory leak detection) but I can't understand how to use any of them. MemoryLogic seems simple enough but the installation instructions seem to lack some info...
So, please help me to either determine whether the code above should use that much memory or (super simple) instructions on how to find the memory leak.
Thanks in advance!
Upvotes: 1
Views: 992
Reputation: 84182
Parsing a large xml file and turning it into a document tree will in general create an in memory representation that is far larger that the xml data itself. Consider for example
<foo attr="b" />
which is only 16 bytes long (assuming a single byte character encoding). The in memory representation of this document will include an object to represent the element itself, probably an (empty) collection of children, a collection of attributes for that element containing at least one thing. The element itself has properties likes its name, namespace pointers to its parent document and so on. The data structures for each of those things is probably going to be over 16 bytes, even before they're wrapped in ruby objects by nokogiri (each of which has a memory footprint which is almost certainly >= 16 bytes).
If you're parsing large xml files you almost certainly want to use a event driven parser like a SAX parser that responds to elements as they are encountered in the document rather than building a tree representation on the entire document an then working on that.
Upvotes: 2
Reputation: 9577
Are you sure you aren't running up against the upper limits of what heroku allows for 'long running tasks'?
I've timed out and had stuff just fail on me all the time due to some of the restrictions heroku puts on the freebie people.
I mean, can you replicate this in your dev? How long does it take on your machine to do what you want?
EDIT 1:
What is this too by the way?
open_uri_fetched = open(feed.fetch_url)
Where is the url it is fetching? Does it bork there or on the actually Nokogiri call. How long does this fetch take anyways?
Upvotes: 1