Reputation: 1246
I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>
) using libxml in Ruby. I have tried the Reader class in combination with the expand
method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:
File.open('dblp.xml') do |io|
dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
pubFactory = PubFactory.new
i = 0
while dblp.read do
case dblp.name
when 'article', 'inproceedings', 'book':
pub = pubFactory.create(dblp.expand)
i += 1
puts pub
pub = nil
$stderr.puts i if i % 10000 == 0
dblp.next
when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
# ignore for now
dblp.next
else
# nothing
end
end
end
The key here is that dblp.expand
reads an entire subtree (like an <article>
record) and passes it as an argument to a factory for further processing. Is this the right approach?
Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?
def first(root, node)
x = root.find(node).first
x ? x.content : nil
end
pub.pages = first(node,'pages') # node contains expanded node from dblp.expand
Upvotes: 4
Views: 3959
Reputation: 1
I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like
my_node = dblp.expand [do what you have to do with my_node] dblp.next my_node.remove!
Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.
Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.
Upvotes: 0
Reputation: 40461
When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:
I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when...
constructs
Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .
Upvotes: 5
Reputation: 211690
When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.
The event-based model is employed by the SAX-style parser, and derivative implementations.
Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html
REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html
Upvotes: 1