Reputation: 145
I'm working on an application that needs to store a large 2GB+ XML file for processing, and I'm facing two problems:
Upvotes: 2
Views: 797
Reputation: 160601
Most of the time we prefer parsing the entire file that has been pulled into memory because it's easier to jump back and forth, extracting this and that as our code needs. Because it's in memory, we can do random access easily, if we want.
For your need, you'll want to start at the top of the file, and read each line, looking for the tags of interest, until you get to the end of the file. For that, you want to use Nokogiri::XML::SAX and Nokogiri::XML::SAX::Parser, along with the events in Nokogiri::XML::SAX::Document. Here's a summary of what it does, from Nokogiri's site:
The basic way a SAX style parser works is by creating a parser, telling the parser about the events we’re interested in, then giving the parser some XML to process. The parser will notify you when it encounters events your said you would like to know about.
SAX is a different beast than dealing with the DOM, but it can be very fast, and is a lot easier on memory.
If you wanted to load the file in smaller chunks, you could process the XML inside an OpenURI.open
or Net::HTTP
block, so you'd be getting it in TCP packet-size chunks. The problem then is that your lines could be split, because TCP doesn't guarantee reading by lines, but by blocks, which is what you'll see inside the read loop. Your code would have to peel off partial lines at the end of the buffer, and then prepend them to the read buffer so the next block read finishes the line.
Upvotes: 1
Reputation: 11086
You'll need a streaming parser. Have a look at https://github.com/craigambrose/sax_stream
You could run your own FTP server on EC2? Or use a hosted provider such as https://hostedftp.com/
Upvotes: 0