Reputation: 59
I have a huge xml file with the format
<XML>
<Application id="1" attr1="some value" attr2="some val"..and many more attr also with nested tags inside application which might contain more attributes
</Application>
<Application id="2"attr1="some value" attr2="some val"..and many more attralso with nested tags inside application which might contain more attributes
</Application>
<Application id="3"attr1="some value" attr2="some val"..and many more attr also with nested tags inside application which might contain more attributes
</Application>
.... probably 10000 more Application entries
</XML>
Each Application tag only has attributes no content, but also contains nested tags which can have attributes and i need to parse and extract some of the attributes. I am using the following script, it works fine on a small subset of Application tags, but gets extremely slow when records get higher, and unfortunately it gives me a Segmentation Fault Core Dump when i run it on the full file, or even half the file.
Here is my script Any suggestion on how to do this better would be really appreciated.
Upvotes: 3
Views: 639
Reputation: 820
here is the test: input xml file: test2.xml
<?xml version="1.0" encoding="UTF-8"?>
<metabolite>
<version>3.6</version>
<creation_date>2005-11-16 15:48:42 UTC</creation_date>
<update_date>2014-06-11 23:17:42 UTC</update_date>
<accession>HMDB00001</accession>
<secondary_accessions>
<accession>HMDB04935</accession>
<accession>HMDB06703</accession>
<accession>HMDB06704</accession>
</secondary_accessions>
<name>1-Methylhistidine</name>
</metabolite>
here is my perl script: parse_hmdb_metabolites_xml.pl
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use XML::Simple;
my $usage= "\n$0
--xml \t<str>\thmdb xml file
--outf \t<str>\toutput file
\n";
my($xml,$outf);
GetOptions(
"xml:s"=>\$xml,
"outf:s"=>\$outf
);
die $usage if !defined $xml;
print "$xml\n";
my $cust_xml = XMLin($xml);
here is the test output:
perl parse_hmdb_metabolites_xml.pl --xml test2.xml
test2.xml
Segmentation fault (core dumped)
I will test XML::libXML
Upvotes: 0
Reputation: 16171
I am sure you can get XML::LibXML::Reader to do this, but I am not familiar with it. So here is how you would do it with XML::Twig.
I just gave you examples of how to get to the data inside the Application
element.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
$filename1 = "exam.xml";
my $parser = XML::Twig->new( twig_handlers => { Application => \&process_application })
->parsefile($filename1);
sub process_application
{ my( $t, $sample)= @_;
my $hncid = $sample->att('ID); # get an attribute
my @persons = $sample->children( 'Person');
my @aplnamt = map { $_->att( 'APLN') } @persons; # that's how you get all attribute values
my @students = $sample->findnodes( './Person/Student');
my @nsschl = map { $_->att('NS') } @students;
my @d81 = $sample->descendant('*[@D8CHRG]');
my @d81 = $sample->findnodes('.//*[@D8CHRG]'); # you can use a subset of XPath
$t->purge; # this is where you free the memory
}
Now that I think of it, you can actually use XML::Twig::XPath to get the full power of XPath, I am just more used to XML::Twig's native navigation methods.
Upvotes: 2
Reputation: 9075
I think your problem is that libXML is a tree based parser so the whole of your document is read into memory. You could investigate a stream based parser and build your own structures of what you need
Upvotes: 1