Krishh
Krishh

Reputation: 33

parsing particular sections of a file in perl

I am new in perl and exploring it

I have a .xml file and I am looking to get few sections of it. Each section starts and ends with <field>. and I want to get content in between them

 <field>
    <address>20</address>
    <startat>0</startat>
    <size>8</size>
 <field>

 <field>
    <address>21</address>
    <startat>0</startat>
    <size>8</size>
<field>

and output I am looking as below

    <address>20</address>
    <startat>0</startat>
    <size>8</size>

    <address>21</address>
    <startat>0</startat>
    <size>8</size>  

How would I go about extracting that part of the file?

Any help is much appreciated.

Upvotes: 0

Views: 71

Answers (1)

Javier Elices
Javier Elices

Reputation: 2154

You may go about this problem by going through the text, but it is always safer to use an XML parser. There are a number of excellent Perl XML libraries available in CPAN. One that I like is XML::LibXML (see here) which is an interface to libxml2. It provides lots of possibilities. Using the functionality of XML::LibXML::XPathContext we could do:

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML;

my $parser = XML::LibXML->new( recover => 1 );
my $xp = $parser->parse_string(<<'EndXML');
  <document>
    <field>
      <address>20</address>
      <startat>0</startat>
      <size>8</size>
    </field>

    <field>
      <address>21</address>
      <startat>0</startat>
      <size>8</size>
    </field>
  </document>
EndXML

if( $@ ) {
  die "Cannot parse XML\n";
}

foreach my $c ( $xp->findnodes('//field') ) {
  print $c->findnodes('.'), "\n";
}

The output:

<field>
      <address>20</address>
      <startat>0</startat>
      <size>8</size>
    </field>
<field>
      <address>21</address>
      <startat>0</startat>
      <size>8</size>
    </field>

A few comments:

  1. The option recover => 1 may be useful to parse broken XML files. It will not fix all problems, but may be able to fix some of them. Leave empty if you want no fixing. Use recover => 2 to make the fixing silent.
  2. This code uses findnodes which takes an XPath expression. In this case //field will find any <field> tags. Then findnodes('.') will get the whole content of the "field".

Upvotes: 2

Related Questions