Christoffer
Christoffer

Reputation: 2411

Rails Parsing a large XML with Nokogiri::XML::Reader => Model.create

I have plenty large (32 Mb) XML-files with product information from different stores. I am using Rails which is hosted on Heroku.

I want to parse these XML-feeds and write these products into my database. I have a semi-working solution but it is very slow and too memory intensive.

I have up until now been using more or less this:

open_uri_fetched = open(xml_from_url)
xml_list = Nokogiri::HTML(open_uri_fetched)
xml_list.xpath("//product").each do |product|
// parsed nodes
// Model.create()
end

This has been working to some extent. However, this has caused memory problems on Heroku which crashes the script. It is also VERY slow (I do this for 200+ feeds).

Heroku told me to fix the problem by using Nokogiri::XML::Reader which is what I am trying to do now.

I have also looked into using:

ActiveRecord::Base.transaction do
Model.create()
end

to speed up the Model.create()-process.

1. My first question: Is this the right way (or at least a decent way) to go for my problem?

NOW, this is what I try to do:

  reader = Nokogiri::XML::Reader(File.open('this_feed.xml'))
  reader.each do |node|
    if node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
      if node.name.downcase == xname
        puts "Name: " + node.inner_xml
        use_name = node.inner_xml
      end
    end
  end

Question 2: but where do I put the Model create-code?

ActiveRecord::Base.transaction do
  Model.create(:name => use_name)
end

If I put it in the loop, it will try to create for each node, which is wrong. I want it to be called after each product in the xml-list, right?

If I create a Hash that is being built up during the reading of the XML (and then used to create the Model-creates), will that not be just as Memory intensive as opening the XML-file in the first place?

The XML-file looks, in short, like this:

<?xml version="1.0" encoding="UTF-8" ?>
<products>
    <product>
        <name>This cool product</name>
        <categories>
            <category>Food</category>
            <category>Drinks</category>
        </categories>
        <SKU />
        <EAN />
        <description>A long description...</description>
        <model />
        <brand />
        <gender />
        <price>126.00</price>
        <regularPrice>126.00</regularPrice>
        <shippingPrice />
        <currency>SEK</currency>
        <productUrl>http://www.domain.com/1.html</productUrl>
        <graphicUrl>http://www.domain.com/1.jpg</graphicUrl>
        <inStock />
        <inStockQty />
        <deliveryTime />
    </product>
</products>

Upvotes: 1

Views: 1593

Answers (1)

Mark Thomas
Mark Thomas

Reputation: 37517

Reader simply scans the document a single time. You have to keep track of state yourself: which elements you've seen, whether you're inside elements you care about, etc.

This gist is a little-known beauty that vastly improves Reader syntax. It keeps track of state for you, in a very easy to read fashion.

Here's an example of how to use it, taken from the comments:

Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
  inside_element 'User' do
    for_element 'Name' do puts "Username: #{inner_xml}" end
    for_element 'Email' do puts "Email: #{inner_xml}" end

    for_element 'Address' do
      puts 'Start of address:'
      inside_element do
        for_element 'Street' do puts "Street: #{inner_xml}" end
        for_element 'Zipcode' do puts "Zipcode: #{inner_xml}" end
        for_element 'City' do puts "City: #{inner_xml}" end
      end
      puts 'End of address'
    end
  end
end

Someone should really make a gem out of this little, um, gem.

In your case, you can have an inside_element 'product' block, extract the elements you need, and create your model instance at the end of your product block.

Upvotes: 3

Related Questions