rubber boots
rubber boots

Reputation: 15204

Practicable way of reading xml with huge text nodes in Perl

After encountering xml data files containing huge text nodes, I looked for some ways to read and evaluate them in my data processing scripts.

The xml files are 3D co-ordinate files for molecular modeling applications an have this structure (example):

<?xml version="1.0" encoding="UTF-8"?>
<hoomd_xml version="1.4">
   <configuration>
      <position>
        -0.101000   0.011000  -40.000000
        -0.077000   0.008000  -40.469000
        -0.008000   0.001000  -40.934000
        -0.301000   0.033000  -41.157000
         0.213000  -0.023000  -41.348000
         ...
         ... 300,000 to 500,000 lines may follow  >>
         ...
        -0.140000   0.015000  -42.556000
      </position>

      <next_huge_section_of_the_same_pattern>
        ...
        ...
        ...
      </next_huge_section_of_the_same_pattern>

   </configuration>
</hoomd_xml>

Each xml files contains several huge text nodes and has sizes between 60MB and 100MB depending on the contents.

I tried the naíve approch using XML::Simple first but the loader would take forever to initially parse the file:

...
my $data = $xml->XMLin('structure_80mb.xml');
...

and stop with "internal error: huge input lookup", so this approach isn't very practicable.

The next try was to use XML::LibXML for reading - but here, the initial loader would bail out immediately with error message "parser error : xmlSAX2Characters: huge text node".

Befor writing on this topic on stackoverflow, I wrote a q&d parser for myself and sent the file through it (after slurping the xx MB xml file into the scalar $xml):

...
# read the <position> data from in-memory xml file
my @Coord = xml_parser_hack('position', $xml);
...

which returns the data of each line as an array, completes within seconds and looks like this:

sub xml_parser_hack {
 my ($tagname, $xml) = @_;
 return () unless $xml =~ /^</;

 my @Data = ();
 my ($p0, $p1) = (undef,undef);
 $p0 = $+[0] if $xml =~ /^<$tagname[^>]*>[^\r\n]*[r\n]+/msg; # start tag
 $p1 = $-[0] if $xml =~ /^<\/$tagname[^>]*>/msg;             # end tag
 return () unless defined $p0 && defined $p1;
 my @Lines = split /[\r\n]+/, substr $xml, $p0, $p1-$p0;
 for my $line (@Lines) {
    push @Data, [ split /\s+/, $line ];
 }
 return @Data;
}

This works fine so far but cannot considered 'production ready', of course.

Q: How would I read the file using a Perl module? Which module would I choose?

Thanks in advance

rbo


Addendum: after reading choroba's comment, I looked deeper into XML::LibXML. The opening of the file my $reader = XML::LibXML::Reader->new(location =>'structure_80mb.xml'); works, contrary to what I thought before. The error occurs if I try to access the text node below the tag:

...
while ($reader->read) {
   # bails out in the loop iteration after accessing the <position> tag,
   # if the position's text node is accessed
   #   --  xmlSAX2Characters: huge text node ---
...

Upvotes: 6

Views: 4917

Answers (2)

nwellnhof
nwellnhof

Reputation: 33658

Try XML::LibXML with the huge parser option:

my $doc = XML::LibXML->load_xml(
    location => 'structure_80mb.xml',
    huge     => 1,
);

Or, if you want to use XML::LibXML::Reader:

my $reader = XML::LibXML::Reader->new(
    location => 'structure_80mb.xml',
    huge     => 1,
);

Upvotes: 3

Joel
Joel

Reputation: 3483

I was able to simulate an answer using XML::LibXML. Try this, and let me know if it doesn't work. I created an XML doc with more than 500k lines in the position element, and I was able to parse it and print the contents of it:

use strict;
use warnings;
use XML::LibXML;

my $xml = XML::LibXML->load_xml(location => '/perl/test.xml');
my $nodes = $xml->findnodes('/hoomd_xml/configuration/position');
print $nodes->[0]->textContent . "\n";
print scalar(@{$nodes}) . "\n";

I'm using findnodes to use an XPath expression to pull out all the nodes that I want. $nodes is just an array ref, so you can loop through it depending on how many nodes you actually have in your document.

Upvotes: 1

Related Questions