pratz
pratz

Reputation: 358

Parsing an XML document in Perl

I have a bizarre XML document arranged in the following manner

<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
    ...
  <b>
    ....
  </b>
  <e/>
</a>

I want to extract the values of d2, d4, d5 for all the c nodes within all the b nodes.

I tried using XML::Simple and ran into a lot of difficulties with array referencing. I tried using XML::DOM, but considering my XML file is 500MB in size, it does not seem to be a good option. Please suggest a good approach as I'm new to Perl

Upvotes: 1

Views: 247

Answers (3)

choroba
choroba

Reputation: 242123

Using xsh:

for a/b/c/d ls (@d2 | @d4 | @d5);

Update: (for mirod): Using XML::XSH2 from Perl is less elegant, but can still work -

#!/usr/bin/perl
use strict;
use warnings;

use XML::XSH2;

xsh q{
    open 1.xml ;
    for /a/b/c/d {
        for my $attr in (@d2 | @d4 | @d5) {
            perl { push @ar, $attr }
        }
    }
};

printf "%s:%s\n", $_->name, $_->value for @XML::XSH2::Map::ar;

Or, let Perl write the xsh code for you:

#!/usr/bin/perl
use warnings;
use strict;

use XML::XSH2;

xsh 'open 1.xml';
xsh '$attributes = (' . join('|', map 'a/b/c/@d' . $_, 1, 2, 4) . ')';
for (@$XML::XSH2::Map::attributes) {
    print $_->name, '=', $_->value, "\n";
}

Upvotes: 1

mirod
mirod

Reputation: 16171

Your question is a bit confusing, you want the attributes for the d element, not for the c element. Or maybe you want the values of the attributes no matter what the element under c is.

In any case, especially if the file is big, this looks like a good match for XML::Twig:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new( twig_handlers => { 'b/c/*' => \&get_atts })
         ->parse( \*DATA); # replace by parsefile( 'my.xml') 

sub get_atts
  { my( $t, $elt)= @_;
    foreach my $att ( qw( d2 d4 d5))
      { print "$att: ", $elt->att( $att), " "; }
    print "\n";
    $t->purge; # this frees the memory so you keep at most 1 d element 
  }

__DATA__
<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
  </b>
  <b>
  </b>
  <e/>
</a>

If the attributes are always in d elements, replace 'b/c/*' with 'b/c/d', which will be more efficient.

Upvotes: 2

Borodin
Borodin

Reputation: 126752

There are many XML modules in CPAN that will help you with this, but in this case my money is on XML::XPath, which allows you to succinctly describe the data you want to extract from the XML.

This program uses you sample data and provides the output I think you want (although strictly there are no d="xx" attributes for any <c> nodes).

use strict;
use warnings;

use feature 'say';

use XML::XPath;

my $xml = XML::XPath->new(ioref => \*DATA);

for my $cnode ($xml->find('//b/c/d')->get_nodelist) {
  for ($cnode->find('@d2|@d4|@d5')->get_nodelist) {
    print $_->getData, "\n";
  }
}

__DATA__
<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
    ...
  </b>
  <e/>
</a>

output

blah1
blah3
blah4
blah6
blah8
blah9
blah11
blah13
blah14
blah1
blah3
blah4
blah6
blah8
blah9
blah11
blah13
blah14

Upvotes: 1

Related Questions