user1897691
user1897691

Reputation: 2461

How can I speed up XML::Twig

I am using XML::Twig to parse through a very large XML document. I want to split it into chunks based on the <change></change> tags.

Right now I have:

my $xml = XML::Twig->new(twig_handlers => { 'change' => \&parseChange, });
$xml->parsefile($LOGFILE);

sub parseChange {

  my ($xml, $change) = @_;

  my $message = $change->first_child('message');
  my @lines   = $message->children_text('line');

  foreach (@lines) {
    if ($_ =~ /[^a-zA-Z0-9](?i)bug(?-i)[^a-zA-Z0-9]/) {
      print outputData "$_\n";
    }
  }

  outputData->flush();
  $change->purge;
}

Right now this is running the parseChange method when it pulls that block from the XML. It is going extremely slow. I tested it against reading the XML from a file with $/=</change> and writing a function to return the contents of an XML tag and it went much faster.

Is there something I'm missing or am I using XML::Twig incorrectly? I'm new to Perl.

EDIT: Here is an example change from the changes file. The file consists of a lot of these one right after the other and there should not be anything in between them:

<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>      
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>      
<author_name>Jean-Baptiste Queru</author_name>      
<author_e-mail>[email protected]</author_e-mail>      
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>      
<commiter_name>Jean-Baptiste Queru</commiter_name>      
<commiter_email>[email protected]</commiter_email>      
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>      
<subject>chmod the output scripts</subject>      
<message>         
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>      
</message>      
<target>         
    <line>generate-blob-scripts.sh</line>      
</target>   
</change>

Upvotes: 5

Views: 1244

Answers (5)

Dave Hodgkinson
Dave Hodgkinson

Reputation: 375

Mine's taking an horrifically long time.

    my $twig=XML::Twig->new
  (
twig_handlers =>
   {
    SchoolInfo => \&schoolinfo,
   },
   pretty_print => 'indented',
  );

$twig->parsefile( 'data/SchoolInfos.2018-04-17.xml');

sub schoolinfo {
  my( $twig, $l)= @_;
  my $rec = {
                 name   => $l->field('SchoolName'),
                 refid  => $l->{'att'}->{RefId},
                 phone  => $l->field('SchoolPhoneNumber'),
                };

  for my $node ( $l->findnodes( '//Street' ) )    { $rec->{street} = $node->text; }
  for my $node ( $l->findnodes( '//Town' ) )      { $rec->{city} = $node->text; }
  for my $node ( $l->findnodes( '//PostCode' ) )  { $rec->{postcode} = $node->text; }
  for my $node ( $l->findnodes( '//Latitude' ) )  { $rec->{lat} = $node->text; }
  for my $node ( $l->findnodes( '//Longitude' ) ) { $rec->{lng} = $node->text; }     
}

Is it the pretty_print perchance? Otherwise it's pretty straightforward.

Upvotes: 0

Jurgen Pletinckx
Jurgen Pletinckx

Reputation: 108

Not an XML::Twig answer, but ...

If you're going to extract stuff from xml files, you might want to consider XSLT. Using xsltproc and the following XSL stylesheet, I got the bug-containing change lines out of 1Gb of <change>s in about a minute. Lots of improvements possible, I'm sure.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >

  <xsl:output method="text"/>
  <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
  <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />

  <xsl:template match="/">
    <xsl:apply-templates select="changes/change/message/line"/>
  </xsl:template>

  <xsl:template match="line">
    <xsl:variable name="lower" select="translate(.,$uppercase,$lowercase)" />
    <xsl:if test="contains($lower,'bug')">
      <xsl:value-of select="."/>
      <xsl:text>
</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

If your XML processing can be done as

  1. extract to plain text
  2. wrangle flattened text
  3. profit

then XSLT may be the tool for the first step in that process.

Upvotes: 0

creaktive
creaktive

Reputation: 5220

If your XML is really big, use XML::SAX. It doesn't have to load entire data set to the memory; instead, it sequentially loads the file and generates callback events for every tag. I successfully used XML::SAX to parse XML with size of more than 1GB. Here is an example of a XML::SAX handler for your data:

#!/usr/bin/env perl
package Change::Extractor;
use 5.010;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    bless { data => '', path => [] }, shift;
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
    push @{$self->{path}} => $el->{Name};
}

sub end_element {
    my ($self, $el) = @_;
    if ($self->{path} ~~ [qw[change message line]]) {
        say $self->{data};
    }
    pop @{$self->{path}};
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use XML::SAX::PurePerl;

my $handler = Change::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_file(\*DATA);

__DATA__
<?xml version="1.0"?>
<change>
  <project>device_common</project>
  <commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
  <tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
  <parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
  <author_name>Jean-Baptiste Queru</author_name>
  <author_e-mail>[email protected]</author_e-mail>
  <author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
  <commiter_name>Jean-Baptiste Queru</commiter_name>
  <commiter_email>[email protected]</commiter_email>
  <committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
  <subject>chmod the output scripts</subject>
  <message>
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
  </message>
  <target>
    <line>generate-blob-scripts.sh</line>
  </target>
</change>

Outputs

Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f

Upvotes: 0

Borodin
Borodin

Reputation: 126742

As it stands, your program is processing all of the XML document, including the data outside the change elements that you aren't interested in.

If you change the twig_handlers parameter in your constructor to twig_roots, then the tree structures will be built for only the elements of interest and the rest will be ignored.

my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });

Upvotes: 3

dan1111
dan1111

Reputation: 6566

XML::Twig includes a mechanism by which you can handle tags as they appear, then discard what you no longer need to free memory.

Here is an example taken from the documentation (which also has a lot more helpful information):

my $t= XML::Twig->new( twig_handlers => 
                          { section => \&section,
                            para   => sub { $_->set_tag( 'p'); }
                          },
                       );
  $t->parsefile( 'doc.xml');

  # the handler is called once a section is completely parsed, ie when 
  # the end tag for section is found, it receives the twig itself and
  # the element (including all its sub-elements) as arguments
  sub section 
    { my( $t, $section)= @_;      # arguments for all twig_handlers
      $section->set_tag( 'div');  # change the tag name.4, my favourite method...
      # let's use the attribute nb as a prefix to the title
      my $title= $section->first_child( 'title'); # find the title
      my $nb= $title->att( 'nb'); # get the attribute
      $title->prefix( "$nb - ");  # easy isn't it?
      $section->flush;            # outputs the section and frees memory
    }

This will probably be essential when working with a multi-gigabyte file, because (again, according to the documentation) storing the entire thing in memory can take as much as 10 times the size of the file.

Edit: A couple of comments based on your edited question. It is not clear exactly what is slowing you down without knowing more about your file structure, but here are a few things to try:

  • Flushing the output filehandle will slow you down if you are writing a lot of lines. Perl caches file writing specifically for performance reasons, and you are bypassing that.
  • Instead of using the (?i) mechanism, a rather advanced feature that probably has a performance penalty, why not make the whole match case insensitive? /[^a-z0-9]bug[^a-z0-9]/i is equivalent. You also might be able to simplify it with /\bbug\b/i, which is nearly equivalent, the only difference being that underscores are included in the non-matching class.
  • There are a couple of other simplifications that can be made as well to remove intermediate steps.

How does this handler code compare to yours speed-wise?

sub parseChange
{
    my ($xml, $change) = @_;

    foreach(grep /[^a-z0-9]bug[^a-z0-9]/i, $change->first_child_text('message'))
    {
        print outputData "$_\n";
    }

    $change->purge;
}

Upvotes: 1

Related Questions