Reputation: 3162
I wrote a code to import an xml-file into a database and encountered a huge performance problem. The xml-file has 150 Mb and contains 170,000 elements (ART) with corresponding subelements (REF_DATA, ...). I was able to find out the cause of the problem but I don't know how to solve it.
Each element ART has subelements (see figure). The problem arises in case where I have several subelements ARTPRI within ART which are distincted bei their subelements PTYP. I would like to extract each data ARTPRI/VDAT and ARTPRI/PRICE and import into variables $v_dat_pexf
, $v_dat_ppub
,$v_dat_zurr
, etc.
Below is a minimal example of my code. This code needs 30 seconds to read one element ART. When I remove the part (START node2 / END node2) then the xml-file is processed very fast (< 1s/ART).
Does anyone have an idea why this part of the code slows down the process and how to cope with that? Thanks for help.
And this is the code:
my $xml_article = "oddb_article.xml";
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(sr => 'http://whatever');
my $doc = XML::LibXML->load_xml(location => $xml_article);
my @node1_art = $xpc->findnodes("/sr:ARTICLE/sr:ART", $doc);
my $i = 0;
foreach my $node1 ( @node1_art ) {
$i++;
my $ref_data = $xpc->findvalue('./sr:REF_DATA',$node1);
my @node1_art_artpri = $xpc->findnodes("/sr:ARTICLE/sr:ART/sr:ARTPRI", $doc);
my $v_dat_pexf;
# -- search through each ARTPRI within ART
# (This is the part which slows down processing)
# -------- START node2 -----------------------
foreach my $node2 ( @node1_art_artpri ) {
my $ctrl1 = $xpc->findvalue('./sr:PTYP',$node2);
if ( $ctrl1 eq 'PEXF' ) {
$v_dat_pexf = $xpc->findvalue('./sr:VDAT',$node2);
}
# -------- END node2 -----------------------
}
print "Row $i\n";
}
Here a version for copy-paste with 3 elememts of ART:
<?xml version="1.0" encoding="utf-8"?>
<ARTICLE xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://whatever" CREATION_DATETIME="2015-11-17T05:44:14+0100" PROD_DATE="2015-11-17T05:44:14+0100" VALID_DATE="2015-11-17T05:44:14+0100">
<ART DT="" SHA256="2744a856e9bdf226e68bd555f0695b37f6477c55fca3d9eec36a0740fe8146c2">
<REF_DATA>1</REF_DATA>
<PHAR>0000000</PHAR>
<SALECD>I</SALECD>
<CDBG>N</CDBG>
<BG>N</BG>
<DSCRD>Epimineral Paste</DSCRD>
<DSCRF>Epimineral pâte</DSCRF>
<SORTD>EPIMINERAL PASTE</SORTD>
<SORTF>EPIMINERAL PâTE</SORTF>
<ARTCOMP>
<COMPNO>7601003300741</COMPNO>
</ARTCOMP>
<ARTBAR>
<CDTYP>E13</CDTYP>
<BC>0</BC>
<BCSTAT>A</BCSTAT>
</ARTBAR>
<ARTPRI>
<VDAT>01.10.2015</VDAT>
<PTYP>PEXF</PTYP>
<PRICE>305.83</PRICE>
</ARTPRI>
<ARTPRI>
<VDAT>01.10.2015</VDAT>
<PTYP>PPUB</PTYP>
<PRICE>367.5</PRICE>
</ARTPRI>
<ARTINS>
<NINCD>10</NINCD>
</ARTINS>
</ART>
<ART DT="" SHA256="ac0eb1ad7c81f5476541ead533c48690a2c9cf3b1dd0ba8ae295145b6bcb1b40">
<REF_DATA>0</REF_DATA>
<PHAR>0021976</PHAR>
<SALECD>I</SALECD>
<CDBG>N</CDBG>
<BG>N</BG>
<DSCRD>DIOPARINE Gtt Opht 7500 E 5 ml</DSCRD>
<DSCRF>DIOPARINE Gtt Opht 7500 E 5 ml</DSCRF>
<SORTD>DIOPARINE GTT OPHT 7500 E 5 ML</SORTD>
<SORTF>DIOPARINE GTT OPHT 7500 E 5 ML</SORTF>
<ARTCOMP/>
<ARTBAR>
<CDTYP>E13</CDTYP>
<BC>0</BC>
<BCSTAT>A</BCSTAT>
</ARTBAR>
<ARTPRI>
<VDAT>01.10.2015</VDAT>
<PTYP>PEXF</PTYP>
<PRICE>305.83</PRICE>
</ARTPRI>
<ARTPRI>
<VDAT>01.10.2015</VDAT>
<PTYP>PPUB</PTYP>
<PRICE>367.5</PRICE>
</ARTPRI>
</ART>
<ART DT="" SHA256="ecc62600e79183822abddb3af0d2a1f9dfb9f2c343c51a2cf135a45354ba7de1">
<REF_DATA>0</REF_DATA>
<PHAR>0027447</PHAR>
<SALECD>I</SALECD>
<CDBG>N</CDBG>
<BG>N</BG>
<DSCRD>ARTHROSENEX Salbe 100 g</DSCRD>
<DSCRF>ARTHROSENEX Salbe 100 g</DSCRF>
<SORTD>ARTHROSENEX SALBE 100 G</SORTD>
<SORTF>ARTHROSENEX SALBE 100 G</SORTF>
<ARTCOMP/>
<ARTBAR>
<CDTYP>E13</CDTYP>
<BC>0</BC>
<BCSTAT>A</BCSTAT>
</ARTBAR>
<ARTPRI>
<VDAT>01.10.2015</VDAT>
<PTYP>PEXF</PTYP>
<PRICE>305.83</PRICE>
</ARTPRI>
<ARTPRI>
<VDAT>01.10.2015</VDAT>
<PTYP>PPUB</PTYP>
<PRICE>367.5</PRICE>
</ARTPRI>
</ART>
<RESULT>
<OK_ERROR>OK</OK_ERROR>
<NBR_RECORD>170673</NBR_RECORD>
<ERROR_CODE/>
<MESSAGE/>
</RESULT>
</ARTICLE>
Upvotes: 0
Views: 358
Reputation: 53498
I think the root of your problem will be the combination of these two lines:
my @node1_art = $xpc->findnodes("/sr:ARTICLE/sr:ART", $doc);
my @node1_art_artpri = $xpc->findnodes("/sr:ARTICLE/sr:ART/sr:ARTPRI", $doc);
Because it looks like you're finding every ART
node across the whole document (scanning the whole thing) and then for every such node - you're scanning the whole document again to find every ARTPRI
node.
Which you're then iterating in:
foreach my $node2 ( @node1_art_artpri ) {
... but only capturing the value of the last one you find.
That looks like it might be a logic error (hard to tell for sure).
But have you tried just printing each time that loop iterates, and see how many times it does?
Because it looks like the intent is to only go a small number of times. You only have two ARTPRI
nodes beneath your example ART
but it'll be doing so considerably more.
This should be fixable by:
my @node1_art_artpri = $xpc->findnodes("./sr:ARTPRI", $node1);
(or something similar - the important part is to set the context to the node, not the doc).
Or perhaps making use of xpath
instead of your logical condition:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::LibXML;
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( sr => 'http://whatever' );
my $doc = XML::LibXML->load_xml( location => 'test4.xml' );
foreach my $node1 ( $xpc->findnodes( "/sr:ARTICLE/sr:ART", $doc ) ) {
my $ref_data = $xpc->findvalue( './sr:REF_DATA', $node1 );
my $v_dat_pexf =
$xpc->findvalue( './sr:ARTPRI/sr:PTYP[text()="PEXF"]/../sr:VDAT', $node1 );
print "$ref_data => $v_dat_pexf\n";
}
But actually for this sort of task, I might be thinking in terms of XML::Twig
, which lets you do things via twig_handlers
- and thus keep memory footprint down.
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub extract_pexf {
my ( $twig, $ART ) = @_;
#or stuff it in an array, whatever.
print $ART -> first_child_text('REF_DATA'), " => ";
print $ART -> get_xpath('.//ARTPRI/PTYP[string()="PEXF"]/../VDAT',0) -> text,"\n";
$twig -> purge; #clear processed data from memory.
}
XML::Twig -> new ( twig_handlers => { 'ART' => \&extract_pexf } ) -> parsefile ( 'your_xml' );
Upvotes: 3