Dan Goodspeed
Dan Goodspeed

Reputation: 3560

Basic parsing of XML string with XML::Twig

I've used XML::Simple for over a decade and it's done everything I need it to, and I barely ever touch Perl any more. Though right now I need to parse an XML string to simply: get all of the elements that are children of the root, and for each get their element type, attributes, and content (I don't care if there is any nested elements, just reading the content as a string is perfect). I can do all that with XML::Simple EXCEPT I also need to keep the order, which Simple can't do when there are multiple element types.

I just installed Twig and it looks very overwhelming for something I hoped would be a quick script. It's unlikely that I'll ever use Twig again after this, is this something that Twig can do easily?

Upvotes: 2

Views: 3852

Answers (3)

Sobrique
Sobrique

Reputation: 53478

At a simple level - XML::Twig - traversing children:

#!/usr/bin/perl

use strict;
use warnings; 

use XML::Twig;

my $twig = XML::Twig -> new -> parsefile ( 'myxml.xml' );

foreach my $element ( $twig -> root -> children ) { 
    print $element -> text; #element content. 
}

Extracting element attributes is either done with:

 $element -> att('attributename');

Or you can fetch a hash ref with atts:

 my $attributes = $element -> atts();
 foreach my $key ( keys %$attributes ) {
     print "$key => ", $attributes -> {$key}, "\n";
 }

The thing I particularly like though, is that for XML where you've a long list of similar elements, where you're trying to process - you can define a handler - that's called each time the parser encounters and is handed that subset of XML.

sub process_book {
     my ( $twig, $book )  = @_;
     print $book -> first_child ('title'); 
     $twig -> purge; #discard anything we've already seen. 
}

my $twig = XML::Twig -> new ( twig_handlers => { 'book' => \&process_book } ); 
$twig -> parsefile ( 'books.xml' ); 

Sample XML:

<XML>
   <BOOK>
       <title>Elements of style</title>
       <author>Strunk and White</author>
   </BOOK>
</XML>

Upvotes: 4

mirod
mirod

Reputation: 16136

The code below should give you enough information to get started.

A few notes:

  • to parse a file use parsefile instead of parse
  • you can also use 'level(1)' instead of '/root/*'
  • using a closure to call the handler (process_elt), passing $atts and $strings is the clean way to do this, if you want $atts and $strings to be global variables you can just write '/root/*' => \&process_elt and the handler will be called with the twig and the element as parameters
  • the $t->purge bit is there to free the memory used by the element you just processed, it is useful if the file is too big to fit in memory, otherwise you don't need to use it
  • DDP is Data::Printer, it's only there to check the output, you can use any other way to do this (Data::Dumper, YAML, prints...)

Here is the code:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $atts    = []; # attributes
my $strings = []; # text content

XML::Twig->new( twig_handlers => 
                 { '/root/*' => sub { process_elt( @_, $strings, $atts); } })
         ->parse( \*DATA);

use DDP; p $atts; p $strings;

sub process_elt
  { my( $t, $elt, $strings, $atts)= @_;

    push @$atts, $elt->atts;

    my $string= $elt->text;
    if( $elt->tag eq 'e1')
      { $string=~ s{text}{modified}; }
    push @$strings, $string;

    $t->purge;
  }

__DATA__
<root>
  <e1 att_1="val_1_1" att2= "val_2_1">text content of element 1</e1>
  <e1 att_1="val_1_2" att2= "val_2_2">text content of element 2</e1>
  <e2 att_3="val_3_1" att2= "val_2_3">element with <sub_elt>sub element</sub_elt> inside</e2>
</root>

Upvotes: 1

choroba
choroba

Reputation: 241748

I prefer XML::LibXML. Its Reader doesn't need to keep the whole structure in memory, so it can process large files:

#!/usr/bin/perl
use warnings;
use strict;

use XML::LibXML::Reader;

my $reader = 'XML::LibXML::Reader'->new( location => 'file.xml' );
while ($reader->read) {
    if (1 == $reader->depth
        and XML_READER_TYPE_ELEMENT == $reader->nodeType
       ) {
        my @info = ($reader->name);
        my $inner = $reader->readInnerXml;
        for my $idx (0 .. $reader->attributeCount - 1) {
            $reader->moveToAttributeNo($idx);
            push @info, $reader->name . '=' . $reader->value;
        }
        push @info, $inner;
        print "@info\n";
    }
}

Upvotes: 0

Related Questions