Reputation: 9
Input:
<h2>Chapter One</h2>
<h2>Chapter Two</h2>
<h2>Chapter Three</h2>
<h2>Chapter Four</h2>
Output: what I need
<h2 id="1">Chapter One</h2>
<h2 id="2">Chapter Two</h2>
<h2 id="3">Chapter Three</h2>
<h2 id="4">Chapter Four</h2>
Kindly help on this.. thanks
Upvotes: 1
Views: 189
Reputation: 91
I think the answer above is great if all your input XML is consistent with your example, i.e. very simple containing only elements, or you only have a handful of files to validate afterwards. In general, processing XML as text is a bad thing. By it's nature, it isn't text; it's highly structured. For instance, if encoding matters, varies, say, you'll definitely want to parse it as XML.
I've become partial to XML::Twig, because of the option to stream (one can also build an XML Tree), which is a parse style much closer to the command-line edit you already seen here. I deal with a great deal of data. XML::Twig is actually very easy to use, but the initial learning curve on implementation/config may take a bit of research effort.
Some people prefer XML::Lib (a little simpler to setup), which offers a more DOM-style flavor, but is more expensive applied to large data sets, and a bit more unwieldy with very large files. From there, various modules get a little less complex, XML::Simple.
Again, this greatly depends on your requirements, data size, validation standards etc. The one-liner is quick, but not quite best practice for handling XML.
Possible Solution
Assumptions -
You could use XML::Twig and Lingua::EN::Words2Nums
So, given input:
<root>
<h2>Chapter One</h2>
<h2>Chapter Two</h2>
<h2>Chapter Three</h2>
<h2>Chapter Four</h2>
</root>
This code:
use XML::Twig;
use Lingua::EN::Words2Nums;
my $twig = new XML::Twig(
twig_roots => { 'h2' => \&h2_handler },
twig_print_outside_roots => 1);
sub h2_handler {
my ($twig,$elt) = @_;
my $engNum = $elt->trimmed_text;
$engNum =~ s/^chapter\s([a-z]+)$/$1/i;
my $num = words2nums("$engNum");
if (defined($num) and $num=~/\d+/){
$elt->set_att( id => $num);
}else{
# Whatever you do if some chapter number is not what's expected
}
$elt->flush;
}
$twig->parsefile(pathToYourFile);
Will output:
<root>
<h2 id="1">Chapter One</h2>
<h2 id="2">Chapter Two</h2>
<h2 id="3">Chapter Three</h2>
<h2 id="4">Chapter Four</h2>
</root>
Upvotes: 1
Reputation: 50647
Quick regex,
perl -pe '($n)=/Chapter\s+([0-9]+)/; s|<h2\K| id="$n"|' file
Also you can check What's the best XML parser for Perl?
Upvotes: 1