Jeff Locke
Jeff Locke

Reputation: 627

Parse large XML file into database. Use multiple threads?

I have a 5GB+ XML file that I want to parse into a MySQL database. I currently have a Ruby script that uses a Nokogiri SAX parser to insert every new book into the database, but this method is very slow since it inserts one by one. I need to figure out a way to parse the large file with multiple concurrent threads.

I was thinking I could split up the file into multiple files and multiple scripts would work on each subfile. Or have the script send each item to a background job for inserting into the database. Maybe using delayed_job, resque or sidekiq.

<?xml version="1.0"?>
<ibrary>
  <NAME>cool name</NAME>
  <book ISBN="11342343">
    <title>To Kill A Mockingbird</title>
    <description>book desc</description>
    <author>Harper Lee</author>
  </book>
  <book ISBN="989894781234">
    <title>Catcher in the Rye</title>
    <description>another description</description>
    <author>J. D. Salinger</author>
  </book>
</library>

Does anyone have experience with this? With the current script, it'll take a year to load the database.

Upvotes: 3

Views: 995

Answers (1)

Jon Skeet
Jon Skeet

Reputation: 1502825

This sounds like the perfect job for a producer/consumer queue. You only want one thread parsing the XML - but as it parses items (presumably converting them into some object type ready for insertion) it can put the converted objects onto a queue that multiple threads are reading from. Each consumer thread would just block on the queue until either the queue is "done" (i.e. the producer says there won't be any more data) or there's an item in the queue - in which case it processes it (adding the item to the database) and then goes back to waiting for data. You'll want to experiment with how many consumer threads gives you the maximum throughput - it will depend on various considerations, mostly around how your database is configured and what your connection to it is like.

I don't know anything about threading in Ruby so I can't give you sample code, but I'm sure there must be a good standard producer/consumer queue available, and the rest should be reasonably straightforward.

Upvotes: 1

Related Questions