Reputation: 760
I have a Rake task in my Rails app which looks into a folder for an XML file, parses it, and saves it to a database. The code works OK, but I have about 2100 files totaling 1.5GB, and processing is very slow, about 400 files in 7 hours. There are approximately 600-650 contracts in each XML file, and each contract can have 0 to n attachments. I did not paste all values, but each contract has 25 values.
To speed-up the process I use Activerecord's Import gem, so I am building an array per file and when the whole file is parsed. I do a mass import to Postgres. Only if a record is found is it directly updated and/or a new attachment inserted, but this is like 1 out of 100000 records. This helps a little, instead of doing new record per contract, but now I see that the slow part is XML parsing. Can you please look if I am doing something wrong in my parsing?
When I tried to print the arrays I am building, the slow part was until it loaded/parsed whole file and starts printing array by array. Thats why I assume the probem with speed is in parsing as Nokogiri loads the whole XML before it starts.
require 'nokogiri'
require 'pp'
require "activerecord-import/base"
ActiveRecord::Import.require_adapter('postgresql')
namespace :loadcrz2 do
desc "this task load contracts from crz xml files to DB"
task contracts: :environment do
actual_dir = File.dirname(__FILE__).to_s
Dir.foreach(actual_dir+'/../../crzfiles') do |xmlfile|
next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'
page = Nokogiri::XML(open(actual_dir+"/../../crzfiles/"+xmlfile))
puts xmlfile
cons = page.xpath('//contracts/*')
contractsarr = []
@c =[]
cons.each do |contract|
name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text
conname = contract.xpath("contractorname").text
subject = contract.xpath("subject").text
dateeff = contract.xpath("dateefficient").text
valuecontract = contract.xpath("value").text
attachments = contract.xpath('attachments/*')
attacharray = []
attachments.each do |attachment|
attachid = attachment.xpath("ID").text
attachname = attachment.xpath("name").text
doc = attachment.xpath("document").text
size = attachment.xpath("size").text
arr = [attachid,attachname,doc,size]
attacharray.push arr
end
@con = Crzcontract.find_by_crzid(crzid)
if @con.nil?
@c=Crzcontract.new(:crzname => name,:crzid => crzid,:crzprocname=>procname,:crzconname=>conname,:crzsubject=>subject,:dateeff=>dateeff,:valuecontract=>valuecontract)
else
@con.crzname = name
@con.crzid = crzid
@con.crzprocname=procname
@con.crzconname=conname
@con.crzsubject=subject
@con.dateeff=dateeff
@con.valuecontract=valuecontract
@con.save!
end
attacharray.each do |attar|
attachid=attar[0]
attachname=attar[1]
doc=attar[2]
size=attar[3]
@at = Crzattachment.find_by_attachid(attachid)
if @at.nil?
if @con.nil?
@c.crzattachments.build(:attachid=>attachid,:attachname=>attachname,:doc=>doc,:size=>size)
else
@a=Crzattachment.new
@a.attachid = attachid
@a.attachname = attachname
@a.doc = doc
@a.size = size
@[email protected]
@a.save!
end
end
end
if @c.present?
contractsarr.push @c
end
#p @c
end
#p contractsarr
puts "done"
if contractsarr.present?
Crzcontract.import contractsarr, recursive: true
end
FileUtils.mv(actual_dir+"/../../crzfiles/"+xmlfile, actual_dir+"/../../crzfiles/archive/"+xmlfile)
end
end
end
Upvotes: 2
Views: 1815
Reputation: 160601
There are a number of problems with the code. Here are some ways to improve it:
actual_dir = File.dirname(__FILE__).to_s
Don't use to_s
. dirname
is already returning a string.
actual_dir+'/../../crzfiles'
, with and without a trailing path delimiter is used repeatedly. Don't make Ruby rebuild the concatenated string over and over. Instead define it once, but take advantage of Ruby's ability to build the full path:
File.absolute_path('../../bar', '/path/to/foo') # => "/path/bar"
So use:
actual_dir = File.absolute_path('../../crzfiles', __FILE__)
and then refer to actual_dir
only:
Dir.foreach(actual_dir)
This is unwieldy:
next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'
I'd do:
next if (xmlfile[0] == '.' || xmlfile == 'archive')
or even:
next if xmlfile[/^(?:\.|archive)/]
Compare these:
'.hidden'[/^(?:\.|archive)/] # => "."
'.'[/^(?:\.|archive)/] # => "."
'..'[/^(?:\.|archive)/] # => "."
'archive'[/^(?:\.|archive)/] # => "archive"
'notarchive'[/^(?:\.|archive)/] # => nil
'foo.xml'[/^(?:\.|archive)/] # => nil
The pattern will return a truthy value if it starts with '.'
or is equal to 'archive'
. It's not as readable but it's compact. I'd recommend the compound conditional test though.
In some places, you're concatenating xmlfile
, so again let Ruby do it once:
xml_filepath = File.join(actual_dir, xmlfile)
which will honor the file path delimiter for whatever OS you're running on. Then use xml_filepath
instead of concatenating the name:
xml_filepath = File.join(actual_dir, xmlfile)))
page = Nokogiri::XML(open(xml_filepath))
[...]
FileUtils.mv(xml_filepath, File.join(actual_dir, "archive", xmlfile)
join
is a good tool so take advantage of it. It's not just another name for concatenating strings, because it's also aware of the correct delimiter to use for the OS the code is running on.
You use a lot of instances of:
xpath("some_selector").text
Don't do that. xpath
, along with css
and search
return a NodeSet, and text
when used on a NodeSet can be evil in a way that'll hurtle you down a very steep and slippery slope. Consider this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<root>
<node>
<data>foo</data>
</node>
<node>
<data>bar</data>
</node>
</root>
EOT
doc.search('//node/data').class # => Nokogiri::XML::NodeSet
doc.search('//node/data').text # => "foobar"
The concatenation of the text into 'foobar' can't be split easily and it's a problem we see here in questions too often.
Do this if you expect getting a NodeSet back because of using search
, xpath
or css
:
doc.search('//node/data').map(&:text) # => ["foo", "bar"]
It's better to use at
, at_xpath
or at_css
if you're after a specific node because then text
will work as you'd expect.
See "How to avoid joining all text from Nodes when scraping" also.
There's a lot of replication that could be DRY'd. Instead of this:
name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text
You could do something like:
name, crzid, procname = [
'name', 'ID', 'procname'
].map { |s| contract.at(s).text }
Upvotes: 2