Reputation: 358
I have a bizarre XML document arranged in the following manner
<a>
<b>
<c c1="blah" c2="blah">
<d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
<d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
<d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
</c>
<c c1="blahc" c2="blah">
<d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
<d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
<d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
</c>
...
<b>
....
</b>
<e/>
</a>
I want to extract the values of d2
, d4
, d5
for all the c
nodes within all the b
nodes.
I tried using XML::Simple
and ran into a lot of difficulties with array referencing.
I tried using XML::DOM
, but considering my XML file is 500MB in size, it does not seem to be a good option. Please suggest a good approach as I'm new to Perl
Upvotes: 1
Views: 247
Reputation: 242123
Using xsh:
for a/b/c/d ls (@d2 | @d4 | @d5);
Update: (for mirod): Using XML::XSH2 from Perl is less elegant, but can still work -
#!/usr/bin/perl
use strict;
use warnings;
use XML::XSH2;
xsh q{
open 1.xml ;
for /a/b/c/d {
for my $attr in (@d2 | @d4 | @d5) {
perl { push @ar, $attr }
}
}
};
printf "%s:%s\n", $_->name, $_->value for @XML::XSH2::Map::ar;
Or, let Perl write the xsh code for you:
#!/usr/bin/perl
use warnings;
use strict;
use XML::XSH2;
xsh 'open 1.xml';
xsh '$attributes = (' . join('|', map 'a/b/c/@d' . $_, 1, 2, 4) . ')';
for (@$XML::XSH2::Map::attributes) {
print $_->name, '=', $_->value, "\n";
}
Upvotes: 1
Reputation: 16171
Your question is a bit confusing, you want the attributes for the d
element, not for the c
element. Or maybe you want the values of the attributes no matter what the element under c
is.
In any case, especially if the file is big, this looks like a good match for XML::Twig
:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
XML::Twig->new( twig_handlers => { 'b/c/*' => \&get_atts })
->parse( \*DATA); # replace by parsefile( 'my.xml')
sub get_atts
{ my( $t, $elt)= @_;
foreach my $att ( qw( d2 d4 d5))
{ print "$att: ", $elt->att( $att), " "; }
print "\n";
$t->purge; # this frees the memory so you keep at most 1 d element
}
__DATA__
<a>
<b>
<c c1="blah" c2="blah">
<d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
<d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
<d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
</c>
<c c1="blahc" c2="blah">
<d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
<d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
<d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
</c>
</b>
<b>
</b>
<e/>
</a>
If the attributes are always in d
elements, replace 'b/c/*'
with 'b/c/d'
, which will be more efficient.
Upvotes: 2
Reputation: 126752
There are many XML modules in CPAN that will help you with this, but in this case my money is on XML::XPath
, which allows you to succinctly describe the data you want to extract from the XML.
This program uses you sample data and provides the output I think you want (although strictly there are no d="xx"
attributes for any <c>
nodes).
use strict;
use warnings;
use feature 'say';
use XML::XPath;
my $xml = XML::XPath->new(ioref => \*DATA);
for my $cnode ($xml->find('//b/c/d')->get_nodelist) {
for ($cnode->find('@d2|@d4|@d5')->get_nodelist) {
print $_->getData, "\n";
}
}
__DATA__
<a>
<b>
<c c1="blah" c2="blah">
<d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
<d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
<d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
</c>
<c c1="blahc" c2="blah">
<d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
<d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
<d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
</c>
...
</b>
<e/>
</a>
output
blah1
blah3
blah4
blah6
blah8
blah9
blah11
blah13
blah14
blah1
blah3
blah4
blah6
blah8
blah9
blah11
blah13
blah14
Upvotes: 1