user2987595
user2987595

Reputation: 9

How to generate sequence for an xml element using perl script

Input:

<h2>Chapter One</h2>    
<h2>Chapter Two</h2>    
<h2>Chapter Three</h2>    
<h2>Chapter Four</h2>

Output: what I need

<h2 id="1">Chapter One</h2>
<h2 id="2">Chapter Two</h2>
<h2 id="3">Chapter Three</h2>
<h2 id="4">Chapter Four</h2>

Kindly help on this.. thanks

Upvotes: 1

Views: 189

Answers (2)

Shaun McDonald
Shaun McDonald

Reputation: 91

I think the answer above is great if all your input XML is consistent with your example, i.e. very simple containing only elements, or you only have a handful of files to validate afterwards. In general, processing XML as text is a bad thing. By it's nature, it isn't text; it's highly structured. For instance, if encoding matters, varies, say, you'll definitely want to parse it as XML.

I've become partial to XML::Twig, because of the option to stream (one can also build an XML Tree), which is a parse style much closer to the command-line edit you already seen here. I deal with a great deal of data. XML::Twig is actually very easy to use, but the initial learning curve on implementation/config may take a bit of research effort.

Some people prefer XML::Lib (a little simpler to setup), which offers a more DOM-style flavor, but is more expensive applied to large data sets, and a bit more unwieldy with very large files. From there, various modules get a little less complex, XML::Simple.

Again, this greatly depends on your requirements, data size, validation standards etc. The one-liner is quick, but not quite best practice for handling XML.

Possible Solution

Assumptions -

  • Your XML is well-formed; that is, it has a root element.
  • Your chapters could extend to some number greater than one to which you're willing to type.
  • You won't have chapter values with some form of decimal/fraction (One.One, or One and a Half etc.)

You could use XML::Twig and Lingua::EN::Words2Nums

So, given input:

<root>
   <h2>Chapter One</h2>
   <h2>Chapter Two</h2>
   <h2>Chapter Three</h2>
   <h2>Chapter Four</h2>
</root>

This code:

use XML::Twig;
use Lingua::EN::Words2Nums;

my $twig = new XML::Twig( 
      twig_roots => { 'h2' => \&h2_handler },
      twig_print_outside_roots => 1);

sub h2_handler { 
   my ($twig,$elt) = @_;
   my $engNum = $elt->trimmed_text;
   $engNum =~ s/^chapter\s([a-z]+)$/$1/i;
   my $num = words2nums("$engNum");
   if (defined($num) and $num=~/\d+/){
      $elt->set_att( id => $num);
    }else{
       # Whatever you do if some chapter number is not what's expected
    }
   $elt->flush;
}

$twig->parsefile(pathToYourFile);

Will output:

<root>
   <h2 id="1">Chapter One</h2>
   <h2 id="2">Chapter Two</h2>
   <h2 id="3">Chapter Three</h2>
   <h2 id="4">Chapter Four</h2>
</root>

Upvotes: 1

mpapec
mpapec

Reputation: 50647

Quick regex,

perl -pe '($n)=/Chapter\s+([0-9]+)/; s|<h2\K| id="$n"|' file

Also you can check What's the best XML parser for Perl?

Upvotes: 1

Related Questions