Saad Ahmed
Saad Ahmed

Reputation: 59

LIB:XML for perl parsing huge xml files through xpath causing core segmentation fault

I have a huge xml file with the format

<XML>
<Application id="1" attr1="some value" attr2="some val"..and many more attr also with nested tags inside application which might contain more attributes
</Application>

<Application id="2"attr1="some value" attr2="some val"..and many more attralso with nested tags inside application which might contain more attributes
</Application>

<Application id="3"attr1="some value" attr2="some val"..and many more attr also with nested tags inside application which might contain more attributes
</Application>

 .... probably 10000 more Application entries
</XML>

Each Application tag only has attributes no content, but also contains nested tags which can have attributes and i need to parse and extract some of the attributes. I am using the following script, it works fine on a small subset of Application tags, but gets extremely slow when records get higher, and unfortunately it gives me a Segmentation Fault Core Dump when i run it on the full file, or even half the file.

Here is my script Any suggestion on how to do this better would be really appreciated.

Upvotes: 3

Views: 639

Answers (3)

pengchy
pengchy

Reputation: 820

here is the test: input xml file: test2.xml

<?xml version="1.0" encoding="UTF-8"?>
<metabolite>
  <version>3.6</version>
  <creation_date>2005-11-16 15:48:42 UTC</creation_date>
  <update_date>2014-06-11 23:17:42 UTC</update_date>
  <accession>HMDB00001</accession>
  <secondary_accessions>
    <accession>HMDB04935</accession>
    <accession>HMDB06703</accession>
    <accession>HMDB06704</accession>
  </secondary_accessions>
  <name>1-Methylhistidine</name>
</metabolite>

here is my perl script: parse_hmdb_metabolites_xml.pl

#!/usr/bin/perl -w 

use strict;
use Getopt::Long;
use XML::Simple;

my $usage= "\n$0 
--xml     \t<str>\thmdb xml file
--outf    \t<str>\toutput file
\n";

my($xml,$outf);

GetOptions(
                "xml:s"=>\$xml,
                "outf:s"=>\$outf
);

die $usage if !defined $xml;

print "$xml\n";
my $cust_xml = XMLin($xml);

here is the test output:

perl parse_hmdb_metabolites_xml.pl  --xml test2.xml
test2.xml
Segmentation fault (core dumped)

I will test XML::libXML

Upvotes: 0

mirod
mirod

Reputation: 16171

I am sure you can get XML::LibXML::Reader to do this, but I am not familiar with it. So here is how you would do it with XML::Twig.

I just gave you examples of how to get to the data inside the Application element.

 #!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

$filename1 = "exam.xml";

my $parser = XML::Twig->new( twig_handlers => { Application => \&process_application })
                        ->parsefile($filename1);

sub process_application
  { my( $t, $sample)= @_;
    my $hncid    = $sample->att('ID);                     # get an attribute
    my @persons  = $sample->children( 'Person');
    my @aplnamt  = map { $_->att( 'APLN') } @persons;     # that's how you get all attribute values 
    my @students = $sample->findnodes( './Person/Student');
    my @nsschl   = map { $_->att('NS') } @students;
    my @d81      = $sample->descendant('*[@D8CHRG]'); 
    my @d81      = $sample->findnodes('.//*[@D8CHRG]');   # you can use a subset of XPath

    $t->purge;                                           # this is where you free the memory
  }

Now that I think of it, you can actually use XML::Twig::XPath to get the full power of XPath, I am just more used to XML::Twig's native navigation methods.

Upvotes: 2

KeepCalmAndCarryOn
KeepCalmAndCarryOn

Reputation: 9075

I think your problem is that libXML is a tree based parser so the whole of your document is read into memory. You could investigate a stream based parser and build your own structures of what you need

Upvotes: 1

Related Questions