Reputation: 5059
I have some xml in the format like below which I'm trying to parse using Nokogiri::XML::Reader as the file size is pretty huge ( ~1GB). The file has many packets
of the below format.
From each packet
I need to gather frame.time_epoch
, s1ap.procedureCode
.
I'm currently doing the following.
data = []
file = `some_file.xml`
reader = Nokogiri::XML::Reader(File.open(file))
reader.each do |node|
if (node.name == 'packet' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT)
doc = Nokogiri::XML(node.outer_xml)
next if !doc.css("field[name='s1ap.procedureCode']") ## do nothing if the <packet> is not of s1ap type
epochTime = doc.css("field[name='frame.time_epoch']").first["show"].to_i
procedureCode = procedureCode_node = doc.css("field[name='s1ap.procedureCode']").first["show"].to_i
data << { epochTime: epochTime, procedureCode: procedureCode }
end
end
Issue
The challenge I'm facing is that the parsing is really slow. One thing I notice is that the reader scans all subsequent lines within a <packet> </packet>
- is there a way I can have the reader move to next node with name as packet
rather going through each line within a packet
further.
XML format
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="pdml2html.xsl"?>
<packet>
<proto name="geninfo" pos="0" showname="General information" size="126">
<field name="num" pos="0" show="6" showname="Number" value="6" size="126"/>
</proto>
<proto name="frame" showname="Frame 6: 126 bytes on wire (1008 bits), 126 bytes captured (1008 bits) on interface 0" size="126" pos="0">
<field name="frame.encap_type" showname="Encapsulation type: Ethernet (1)" size="0" pos="0" show="1"/>
<field name="frame.time_epoch" showname="Epoch Time: 1474267259.184197000 seconds" size="0" pos="0" show="1474267259.184197000"/>
</proto>
<proto name="eth" showname="Ethernet II, Src: JuniperN_e6:a6:cc (40:b4:f0:e6:a6:cc), Dst: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="14" pos="0">
<field name="eth.dst" showname="Destination: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="6" pos="0" show="ac:16:2d:89:a5:91" value="ac162d89a591">
<field name="eth.dst_resolved" showname="Destination (resolved): HewlettP_89:a5:91" hide="yes" size="6" pos="0" show="HewlettP_89:a5:91" value="ac162d89a591"/>
</field>
<field name="eth.type" showname="Type: IPv4 (0x0800)" size="2" pos="12" show="0x00000800" value="0800"/>
</proto>
<proto name="s1ap" showname="S1 Application Protocol" size="45" pos="78">
<field name="per.choice_index" showname="Choice Index: 0" hide="yes" size="1" pos="78" show="0" value="00"/>
<field name="s1ap.S1AP_PDU" showname="S1AP-PDU: initiatingMessage (0)" size="45" pos="78" show="0" value="000b402900000300000005c007c03ae900080003403b53001a0012113743f99f9500075d010605f070c04070c1">
<field name="s1ap.initiatingMessage_element" showname="initiatingMessage" size="45" pos="78" show="" value="">
<field name="s1ap.procedureCode" showname="procedureCode: id-downlinkNASTransport (11)" size="1" pos="79" show="11" value="0b"/>
</field>
</field>
</proto>
</packet>
<packet>
<proto name="geninfo" pos="0" showname="General information" size="126">
<field name="num" pos="0" show="6" showname="Number" value="6" size="126"/>
</proto>
<proto name="frame" showname="Frame 6: 126 bytes on wire (1008 bits), 126 bytes captured (1008 bits) on interface 0" size="126" pos="0">
<field name="frame.encap_type" showname="Encapsulation type: Ethernet (1)" size="0" pos="0" show="1"/>
<field name="frame.time_epoch" showname="Epoch Time: 1474267260.184197000 seconds" size="0" pos="0" show="1474267259.184197000"/>
</proto>
<proto name="eth" showname="Ethernet II, Src: JuniperN_e6:a6:cc (40:b4:f0:e6:a6:cc), Dst: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="14" pos="0">
<field name="eth.dst" showname="Destination: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="6" pos="0" show="ac:16:2d:89:a5:91" value="ac162d89a591">
<field name="eth.dst_resolved" showname="Destination (resolved): HewlettP_89:a5:91" hide="yes" size="6" pos="0" show="HewlettP_89:a5:91" value="ac162d89a591"/>
</field>
<field name="eth.type" showname="Type: IPv4 (0x0800)" size="2" pos="12" show="0x00000800" value="0800"/>
</proto>
<proto name="s1ap" showname="Some other protocol" size="45" pos="78">
<field name="per.choice_index" showname="Choice Index: 0" hide="yes" size="1" pos="78" show="0" value="00"/>
<field name="other.OTH_PDU" showname="S1AP-PDU: initiatingMessage (0)" size="45" pos="78" show="0" value="000b402900000300000005c007c03ae900080003403b53001a0012113743f99f9500075d010605f070c04070c1">
<field name="other.initiatingMessage_element" showname="initiatingMessage" size="45" pos="78" show="" value="">
<field name="other.procedureCode" showname="procedureCode: id-someTransport (99)" size="1" pos="79" show="11" value="0b"/>
</field>
</field>
</proto>
</packet>
<packet>
<proto name="geninfo" pos="0" showname="General information" size="126">
<field name="num" pos="0" show="6" showname="Number" value="6" size="126"/>
</proto>
<proto name="frame" showname="Frame 6: 126 bytes on wire (1008 bits), 126 bytes captured (1008 bits) on interface 0" size="126" pos="0">
<field name="frame.encap_type" showname="Encapsulation type: Ethernet (1)" size="0" pos="0" show="1"/>
<field name="frame.time_epoch" showname="Epoch Time: 1474267261.184197000 seconds" size="0" pos="0" show="1474267259.184197000"/>
</proto>
<proto name="eth" showname="Ethernet II, Src: JuniperN_e6:a6:cc (40:b4:f0:e6:a6:cc), Dst: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="14" pos="0">
<field name="eth.dst" showname="Destination: HewlettP_89:a5:91 (ac:16:2d:89:a5:91)" size="6" pos="0" show="ac:16:2d:89:a5:91" value="ac162d89a591">
<field name="eth.dst_resolved" showname="Destination (resolved): HewlettP_89:a5:91" hide="yes" size="6" pos="0" show="HewlettP_89:a5:91" value="ac162d89a591"/>
</field>
<field name="eth.type" showname="Type: IPv4 (0x0800)" size="2" pos="12" show="0x00000800" value="0800"/>
</proto>
<proto name="s1ap" showname="S1 Application Protocol" size="45" pos="78">
<field name="per.choice_index" showname="Choice Index: 0" hide="yes" size="1" pos="78" show="0" value="00"/>
<field name="s1ap.S1AP_PDU" showname="S1AP-PDU: initiatingMessage (0)" size="45" pos="78" show="0" value="000b402900000300000005c007c03ae900080003403b53001a0012113743f99f9500075d010605f070c04070c1">
<field name="s1ap.initiatingMessage_element" showname="initiatingMessage" size="45" pos="78" show="" value="">
<field name="s1ap.procedureCode" showname="procedureCode: id-uplinkTransport (13)" size="1" pos="79" show="13" value="0b"/>
</field>
</field>
</proto>
</packet>
<!-- more <packet>s here -->
Upvotes: 3
Views: 1006
Reputation: 1464
For such a huge document you should use the SAX parser
http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SAX
Stream processing a large document without pulling the whole thing into memory and parsing as a DOM is preferable. Especially given the problem requires only one-pass through.
Here is code that accomplishes the task by streaming the XML with SAX:
require 'nokogiri'
class PacketFilter < Nokogiri::XML::SAX::Document
def initialize
reset
end
def end_document
puts 'the document has ended'
end
def start_element(name, attributes = [])
case name
when 'packet'
@in_packet = true
when 'proto'
@have_s1ap = @in_packet && attribute_value(attributes, 'name') == 's1ap'
when 'field'
case attribute_value(attributes, 'name')
when 's1ap.procedureCode'
@procedure_code = attribute_value(attributes, 'showname')
when 'frame.time_epoch'
@epoch_time = attribute_value(attributes, 'showname')
end
end
end
def end_element(name)
if name == 'packet'
puts "#{@procedure_code}, #{@epoch_time}" if @have_s1ap
reset
end
end
private
def attribute_value(attributes, name)
attributes.reduce(nil) do |value, assoc|
assoc[0] == name ? assoc[1] : value
end
end
def reset
@in_packet = false
@have_s1ap = false
@procedure_code = nil
@epoch_time = nil
end
end
parser = Nokogiri::XML::SAX::Parser.new(PacketFilter.new)
parser.parse($stdin)
If you paste your data sample into data.xml
and the above ruby into slap.rb
:
$ cat data.xml | ruby poke.rb
procedureCode: id-downlinkNASTransport (11), Epoch Time: 1474267259.184197000 seconds
the document has ended
Upvotes: 3
Reputation: 1057
Instead of looping through every node, you could loop through only the packet
elements, and then skip any that don't fit your criteria. This will only do the packet
elements instead of all of the elements, which should be significantly faster.
data = []
file = 'some_file.xml'
doc = Nokogiri::XML.fragment(File.read(file)) # use `read` instead of `open`
doc.xpath('packet').each do |packet|
next if !packet.css("field[name='s1ap.procedureCode']") ## do nothing if the <packet> is not of s1ap type
epochTime = packet.css("field[name='frame.time_epoch']").first["show"].to_i
procedureCode = procedureCode_node = packet.css("field[name='s1ap.procedureCode']").first["show"].to_i
data << { epochTime: epochTime, procedureCode: procedureCode }
end
» data
=> [{:epochTime=>1474267259, :procedureCode=>11}
Upvotes: 0