dusker
dusker

Reputation: 580

Parsing XML file with perl - regex

i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:

    <article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>

What i'd like to do is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter

Upvotes: 1

Views: 1621

Answers (2)

Tomalak
Tomalak

Reputation: 338188

Never ever use Regex to handle markup languages.

The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:

XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

so I made a new version that uses XML::LibXML (thanks, Grant):

use warnings;
use strict;
use XML::LibXML;

my $doc   = XML::LibXML->load_xml(location => 'articles.xml');
my $xp    = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath) ) {
  # now do something with $article
  print $article.": ".$article->getName."\n";
}

For me this prints:

XML::LibXML::Element=SCALAR(0x346ef90): article
XML::LibXML::Element=SCALAR(0x346ef30): article
XML::LibXML::Element=SCALAR(0x346efa8): article

Links to the relevant documentation:


Original version of the answer, based on the XML::XPath package:

use warnings;
use strict;
use XML::XPath;

my $xp    = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
  # now do something with $article
  print $article.": ".$article->getName ."\n";
}

which prints this for me:

XML::XPath::Node::Element=REF(0x38067b8): article
XML::XPath::Node::Element=REF(0x38097e8): article
XML::XPath::Node::Element=REF(0x3809ae8): article

Have a look at the docs to find out what you can do with them.

Upvotes: 12

Snake Plissken
Snake Plissken

Reputation: 678

Here:

 open my $input, "<", "file.xml" or die $!;
 open my $output, ">", "truncated-file.xml" or die $!;
 my $n_articles = 0;
 while (<$input>) {
      print $output $_;
      if (m:</article>:) {
           $n_articles++;
           if ($n_articles >= 3) {
                last;
           }
      }
 }         
 close $input or die $!;
 close $output or die $!;

You really don't need an XML parser to do such a simple job.

Upvotes: 0

Related Questions