user3206440
user3206440

Reputation: 5059

Nokogiri::XML::Reader - processing large XML files and skipping nodes of no interest

I have some xml in the format like below which I'm trying to parse using Nokogiri::XML::Reader as the file size is pretty huge ( ~1GB). The file has many packets of the below format.

From each packet I need to gather frame.time_epoch, s1ap.procedureCode.

I'm currently doing the following.

data = []
file = `some_file.xml`
reader = Nokogiri::XML::Reader(File.open(file))
reader.each do |node|
    if (node.name == 'packet' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT)
      doc = Nokogiri::XML(node.outer_xml)
      next if !doc.css("field[name='s1ap.procedureCode']") ## do nothing if the <packet> is not of s1ap type
      epochTime = doc.css("field[name='frame.time_epoch']").first["show"].to_i
      procedureCode = procedureCode_node = doc.css("field[name='s1ap.procedureCode']").first["show"].to_i
      data << { epochTime: epochTime, procedureCode: procedureCode }
    end
end

Issue

The challenge I'm facing is that the parsing is really slow. One thing I notice is that the reader scans all subsequent lines within a <packet> </packet> - is there a way I can have the reader move to next node with name as packet rather going through each line within a packet further.

XML format

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="pdml2html.xsl"?>
<packet>
  <proto name="geninfo" pos="0" showname="General information" size="126">
    <field name="num" pos="0" show="6" showname="Number" value="6" size="126"/>
  </proto>
  <proto name="frame" showname="Frame 6: 126 bytes on wire (1008 bits), 126 bytes captured (1008 bits) on interface 0" size="126" pos="0">
    <field name="frame.encap_type" showname="Encapsulation type: Ethernet (1)" size="0" pos="0" show="1"/>
    <field name="frame.time_epoch" showname="Epoch Time: 1474267259.184197000 seconds" size="0" pos="0" show="1474267259.184197000"/>
  </proto>
  <proto name="eth" showname="Ethernet II, Src: JuniperN_e6:a6:cc (40:b4:f0:e6:a6:cc), Dst: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="14" pos="0">
    <field name="eth.dst" showname="Destination: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="6" pos="0" show="ac:16:2d:89:a5:91" value="ac162d89a591">
      <field name="eth.dst_resolved" showname="Destination (resolved): HewlettP_89:a5:91" hide="yes" size="6" pos="0" show="HewlettP_89:a5:91" value="ac162d89a591"/>
    </field>
    <field name="eth.type" showname="Type: IPv4 (0x0800)" size="2" pos="12" show="0x00000800" value="0800"/>
  </proto>  
  <proto name="s1ap" showname="S1 Application Protocol" size="45" pos="78">
    <field name="per.choice_index" showname="Choice Index: 0" hide="yes" size="1" pos="78" show="0" value="00"/>
    <field name="s1ap.S1AP_PDU" showname="S1AP-PDU: initiatingMessage (0)" size="45" pos="78" show="0" value="000b402900000300000005c007c03ae900080003403b53001a0012113743f99f9500075d010605f070c04070c1">
      <field name="s1ap.initiatingMessage_element" showname="initiatingMessage" size="45" pos="78" show="" value="">
        <field name="s1ap.procedureCode" showname="procedureCode: id-downlinkNASTransport (11)" size="1" pos="79" show="11" value="0b"/>
       </field>
    </field>
  </proto>
</packet>
<packet>
  <proto name="geninfo" pos="0" showname="General information" size="126">
    <field name="num" pos="0" show="6" showname="Number" value="6" size="126"/>
  </proto>
  <proto name="frame" showname="Frame 6: 126 bytes on wire (1008 bits), 126 bytes captured (1008 bits) on interface 0" size="126" pos="0">
    <field name="frame.encap_type" showname="Encapsulation type: Ethernet (1)" size="0" pos="0" show="1"/>
    <field name="frame.time_epoch" showname="Epoch Time: 1474267260.184197000 seconds" size="0" pos="0" show="1474267259.184197000"/>
  </proto>
  <proto name="eth" showname="Ethernet II, Src: JuniperN_e6:a6:cc (40:b4:f0:e6:a6:cc), Dst: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="14" pos="0">
    <field name="eth.dst" showname="Destination: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="6" pos="0" show="ac:16:2d:89:a5:91" value="ac162d89a591">
      <field name="eth.dst_resolved" showname="Destination (resolved): HewlettP_89:a5:91" hide="yes" size="6" pos="0" show="HewlettP_89:a5:91" value="ac162d89a591"/>
    </field>
    <field name="eth.type" showname="Type: IPv4 (0x0800)" size="2" pos="12" show="0x00000800" value="0800"/>
  </proto>  
  <proto name="s1ap" showname="Some other protocol" size="45" pos="78">
    <field name="per.choice_index" showname="Choice Index: 0" hide="yes" size="1" pos="78" show="0" value="00"/>
    <field name="other.OTH_PDU" showname="S1AP-PDU: initiatingMessage (0)" size="45" pos="78" show="0" value="000b402900000300000005c007c03ae900080003403b53001a0012113743f99f9500075d010605f070c04070c1">
      <field name="other.initiatingMessage_element" showname="initiatingMessage" size="45" pos="78" show="" value="">
        <field name="other.procedureCode" showname="procedureCode: id-someTransport (99)" size="1" pos="79" show="11" value="0b"/>
       </field>
    </field>
  </proto>
</packet>
<packet>
  <proto name="geninfo" pos="0" showname="General information" size="126">
    <field name="num" pos="0" show="6" showname="Number" value="6" size="126"/>
  </proto>
  <proto name="frame" showname="Frame 6: 126 bytes on wire (1008 bits), 126 bytes captured (1008 bits) on interface 0" size="126" pos="0">
    <field name="frame.encap_type" showname="Encapsulation type: Ethernet (1)" size="0" pos="0" show="1"/>
    <field name="frame.time_epoch" showname="Epoch Time: 1474267261.184197000 seconds" size="0" pos="0" show="1474267259.184197000"/>
  </proto>
  <proto name="eth" showname="Ethernet II, Src: JuniperN_e6:a6:cc (40:b4:f0:e6:a6:cc), Dst: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="14" pos="0">
    <field name="eth.dst" showname="Destination: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="6" pos="0" show="ac:16:2d:89:a5:91" value="ac162d89a591">
      <field name="eth.dst_resolved" showname="Destination (resolved): HewlettP_89:a5:91" hide="yes" size="6" pos="0" show="HewlettP_89:a5:91" value="ac162d89a591"/>
    </field>
    <field name="eth.type" showname="Type: IPv4 (0x0800)" size="2" pos="12" show="0x00000800" value="0800"/>
  </proto>  
  <proto name="s1ap" showname="S1 Application Protocol" size="45" pos="78">
    <field name="per.choice_index" showname="Choice Index: 0" hide="yes" size="1" pos="78" show="0" value="00"/>
    <field name="s1ap.S1AP_PDU" showname="S1AP-PDU: initiatingMessage (0)" size="45" pos="78" show="0" value="000b402900000300000005c007c03ae900080003403b53001a0012113743f99f9500075d010605f070c04070c1">
      <field name="s1ap.initiatingMessage_element" showname="initiatingMessage" size="45" pos="78" show="" value="">
        <field name="s1ap.procedureCode" showname="procedureCode: id-uplinkTransport (13)" size="1" pos="79" show="13" value="0b"/>
       </field>
    </field>
  </proto>
</packet>
<!-- more <packet>s here -->

Upvotes: 3

Views: 1006

Answers (2)

s1mpl3
s1mpl3

Reputation: 1464

For such a huge document you should use the SAX parser

http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SAX

Stream processing a large document without pulling the whole thing into memory and parsing as a DOM is preferable. Especially given the problem requires only one-pass through.

Here is code that accomplishes the task by streaming the XML with SAX:

require 'nokogiri'

class PacketFilter < Nokogiri::XML::SAX::Document
  def initialize
    reset
  end

  def end_document
    puts 'the document has ended'
  end

  def start_element(name, attributes = [])
    case name
    when 'packet'
      @in_packet = true
    when 'proto'
      @have_s1ap = @in_packet && attribute_value(attributes, 'name') == 's1ap'
    when 'field'
      case attribute_value(attributes, 'name')
      when 's1ap.procedureCode'
        @procedure_code = attribute_value(attributes, 'showname')
      when 'frame.time_epoch'
        @epoch_time = attribute_value(attributes, 'showname')
      end
    end
  end

  def end_element(name)
    if name == 'packet'
      puts "#{@procedure_code}, #{@epoch_time}" if @have_s1ap
      reset
    end
  end

  private

  def attribute_value(attributes, name)
    attributes.reduce(nil) do |value, assoc|
      assoc[0] == name ? assoc[1] : value
    end
  end

  def reset
    @in_packet = false
    @have_s1ap = false
    @procedure_code = nil
    @epoch_time = nil
  end
end

parser = Nokogiri::XML::SAX::Parser.new(PacketFilter.new)
parser.parse($stdin)

If you paste your data sample into data.xml and the above ruby into slap.rb:

$ cat data.xml | ruby poke.rb
procedureCode: id-downlinkNASTransport (11), Epoch Time: 1474267259.184197000 seconds
the document has ended

Upvotes: 3

brainbag
brainbag

Reputation: 1057

Instead of looping through every node, you could loop through only the packet elements, and then skip any that don't fit your criteria. This will only do the packet elements instead of all of the elements, which should be significantly faster.

data = []
file = 'some_file.xml'
doc = Nokogiri::XML.fragment(File.read(file)) # use `read` instead of `open`
doc.xpath('packet').each do |packet|
    next if !packet.css("field[name='s1ap.procedureCode']") ## do nothing if the <packet> is not of s1ap type
    epochTime = packet.css("field[name='frame.time_epoch']").first["show"].to_i
    procedureCode = procedureCode_node = packet.css("field[name='s1ap.procedureCode']").first["show"].to_i
    data << { epochTime: epochTime, procedureCode: procedureCode }
end 

» data
=> [{:epochTime=>1474267259, :procedureCode=>11}

Upvotes: 0

Related Questions