LewlSauce
LewlSauce

Reputation: 5882

Memory consumption when using Nokogiri and XML

I have been trying to figure out what's up with my Rails app as it relates to memory and Nokogiri's XML parsing. For some reason, this one function alone consumes up about 1GB of memory and does not release it when it's completed. I'm not quite sure what's going on here.

def split_nessus
    # To avoid consuming too much memory, we're going to split the Nessus file
    # if it's over 10MB into multiple smaller files.
    file_size = File.size(@nessus_files[0]).to_f / 2**20
    files = []

    if file_size >= 10
        file_num = 1
        d = File.open(@nessus_files[0])
        content = Nokogiri::XML(d.read)
        d.close
        data = Nokogiri::XML("<data></data>")
        hosts_num = 1

        content.xpath("//ReportHost").each do |report_host|
            data.root << report_host
            hosts_num += 1

            if hosts_num == 100
                File.open("#{@nessus_files[0]}_nxtmp_#{file_num}", "w") {|f| f.write(data.to_xml)}
                files << "#{@nessus_files[0]}_nxtmp_#{file_num}"
                data = Nokogiri::XML("<data></data>")
                hosts_num = 1
                file_num += 1
            end
        end

        @nessus_files = files
    end
end

Since Rails crashes when trying to parse a 100MB+ XML file, I've decided to break XML files into separate files if they're over 10MB, and just trying to handle them individually.

Any thoughts as to why this will not release about 1GB of memory when it's completed?

Upvotes: 0

Views: 770

Answers (1)

phoet
phoet

Reputation: 18845

Nokogiri uses system libraries like libxml and libxslt under the hood. Because of that I would assume that it's probably not an issue in Ruby's garbage collection but somewhere else.

If you are working with large files, it's usually a good idea to stream-process them, so that you do not load the whole file into memory, which in itself will result in huge memory consumption as large strings are very memory consuming.

Because of this, when working with large XML files, you should use a stream parser. In Nokogiri this is Nokogiri::XML::SAX.

Upvotes: 3

Related Questions