Reputation: 2411
I have plenty large (32 Mb) XML-files with product information from different stores. I am using Rails which is hosted on Heroku.
I want to parse these XML-feeds and write these products into my database. I have a semi-working solution but it is very slow and too memory intensive.
I have up until now been using more or less this:
open_uri_fetched = open(xml_from_url)
xml_list = Nokogiri::HTML(open_uri_fetched)
xml_list.xpath("//product").each do |product|
// parsed nodes
// Model.create()
end
This has been working to some extent. However, this has caused memory problems on Heroku which crashes the script. It is also VERY slow (I do this for 200+ feeds).
Heroku told me to fix the problem by using Nokogiri::XML::Reader which is what I am trying to do now.
I have also looked into using:
ActiveRecord::Base.transaction do
Model.create()
end
to speed up the Model.create()-process.
1. My first question: Is this the right way (or at least a decent way) to go for my problem?
NOW, this is what I try to do:
reader = Nokogiri::XML::Reader(File.open('this_feed.xml'))
reader.each do |node|
if node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
if node.name.downcase == xname
puts "Name: " + node.inner_xml
use_name = node.inner_xml
end
end
end
Question 2: but where do I put the Model create-code?
ActiveRecord::Base.transaction do
Model.create(:name => use_name)
end
If I put it in the loop, it will try to create for each node, which is wrong. I want it to be called after each product in the xml-list, right?
If I create a Hash that is being built up during the reading of the XML (and then used to create the Model-creates), will that not be just as Memory intensive as opening the XML-file in the first place?
The XML-file looks, in short, like this:
<?xml version="1.0" encoding="UTF-8" ?>
<products>
<product>
<name>This cool product</name>
<categories>
<category>Food</category>
<category>Drinks</category>
</categories>
<SKU />
<EAN />
<description>A long description...</description>
<model />
<brand />
<gender />
<price>126.00</price>
<regularPrice>126.00</regularPrice>
<shippingPrice />
<currency>SEK</currency>
<productUrl>http://www.domain.com/1.html</productUrl>
<graphicUrl>http://www.domain.com/1.jpg</graphicUrl>
<inStock />
<inStockQty />
<deliveryTime />
</product>
</products>
Upvotes: 1
Views: 1593
Reputation: 37517
Reader simply scans the document a single time. You have to keep track of state yourself: which elements you've seen, whether you're inside elements you care about, etc.
This gist is a little-known beauty that vastly improves Reader syntax. It keeps track of state for you, in a very easy to read fashion.
Here's an example of how to use it, taken from the comments:
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
inside_element 'User' do
for_element 'Name' do puts "Username: #{inner_xml}" end
for_element 'Email' do puts "Email: #{inner_xml}" end
for_element 'Address' do
puts 'Start of address:'
inside_element do
for_element 'Street' do puts "Street: #{inner_xml}" end
for_element 'Zipcode' do puts "Zipcode: #{inner_xml}" end
for_element 'City' do puts "City: #{inner_xml}" end
end
puts 'End of address'
end
end
end
Someone should really make a gem out of this little, um, gem.
In your case, you can have an inside_element 'product'
block, extract the elements you need, and create your model instance at the end of your product block.
Upvotes: 3