Reputation: 5882
I have been trying to figure out what's up with my Rails app as it relates to memory and Nokogiri's XML parsing. For some reason, this one function alone consumes up about 1GB of memory and does not release it when it's completed. I'm not quite sure what's going on here.
def split_nessus
# To avoid consuming too much memory, we're going to split the Nessus file
# if it's over 10MB into multiple smaller files.
file_size = File.size(@nessus_files[0]).to_f / 2**20
files = []
if file_size >= 10
file_num = 1
d = File.open(@nessus_files[0])
content = Nokogiri::XML(d.read)
d.close
data = Nokogiri::XML("<data></data>")
hosts_num = 1
content.xpath("//ReportHost").each do |report_host|
data.root << report_host
hosts_num += 1
if hosts_num == 100
File.open("#{@nessus_files[0]}_nxtmp_#{file_num}", "w") {|f| f.write(data.to_xml)}
files << "#{@nessus_files[0]}_nxtmp_#{file_num}"
data = Nokogiri::XML("<data></data>")
hosts_num = 1
file_num += 1
end
end
@nessus_files = files
end
end
Since Rails crashes when trying to parse a 100MB+ XML file, I've decided to break XML files into separate files if they're over 10MB, and just trying to handle them individually.
Any thoughts as to why this will not release about 1GB of memory when it's completed?
Upvotes: 0
Views: 770
Reputation: 18845
Nokogiri uses system libraries like libxml
and libxslt
under the hood. Because of that I would assume that it's probably not an issue in Ruby's garbage collection but somewhere else.
If you are working with large files, it's usually a good idea to stream-process them, so that you do not load the whole file into memory, which in itself will result in huge memory consumption as large strings are very memory consuming.
Because of this, when working with large XML files, you should use a stream parser. In Nokogiri this is Nokogiri::XML::SAX
.
Upvotes: 3