Reputation: 59
I have an XML file with multiple tags and I want to count the years between certain tags like this
<Dateline>08/Dec./2009</Dateline>
I simply want to get how many 2009s and 2010s and so on the day and the month are not important, I want it like this
2008 >> 10
2009 >> 11
2010 >> 12
2011 >> 15
2012 >> 20
Tried working it in perl but no luck. Also is it possible to print whats between these tags, no matter what date or words in a outside file.
Sample XML:
<Sabanews>
<ID>SBN_ARB_0000001</ID>
<Start URL>sabanews.net/ar/news200024.htm</Start URL>
<Headline>الكونجرس الأمريكي يطالب المجتمع الدولي دعم اليمن لمواجهة التحديات القائمة</Headline>
<Dateline>08/ديسمبر/2009</Dateline>
<Text> واشنطن ـ سبأنت: طالب الكون المزعزعة للاستقرار والعو اليمنيين خصوصا أن يعملوا معا لمجابهة التحديات القائمة". سبأ</Text>
</Sabanews>
Upvotes: 0
Views: 55
Reputation: 53498
This is a simplification, because your specification is vague. Firm it up a bit, and I may clarify/extend. As is, treat as an example of an approach that could be taken.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my %count_of;
sub extract_date {
my ( $twig, $dateline ) = @_;
my $date_string = $dateline->text;
print $date_string,"\n";
my ($year) = ( $date_string =~ m#/(\d+)$# );
$count_of{$year}++;
}
my $parser = XML::Twig->new( twig_roots => { 'Dateline' => \&extract_date } );
#probably want parsefile here in your real world code.
$parser->parse( \*DATA );
foreach my $date ( sort keys %count_of ) {
print $date, " >> ", $count_of{$date}, "\n";
}
__DATA__
<XML>
<Dateline>01/Dec./2009</Dateline>
<Dateline>02/Dec./2009</Dateline>
<Dateline>03/Dec./2020</Dateline>
<Dateline>04/Dec./2015</Dateline>
<Dateline>05/Dec./2015</Dateline>
</XML>
We set a handler, that's triggered each time we see a 'Dateline' element, and ignore anything else.
This handler extracts the text from the element, uses a regular expression to extract the year, and then adds it to %count_of
. Which we print afterwards.
Gives:
01/Dec./2009
02/Dec./2009
03/Dec./2020
04/Dec./2015
05/Dec./2015
2009 >> 2
2015 >> 2
2020 >> 1
Edit: Given the new sample XML - you need something slightly different to the above. The approach should still work though.
Google translate tells me that ديسمبر
is "December" so it' still a date. You may find that using Time::Piece
parses it correctly as that should support locale.
Otherwise you need to extract your 'year' with:
my ($year) = ( $date_string =~ m#^\d+/(\d+)/\w+$# );
Edit: To handle 'command line' specification of filename:
my ( $filename ) = @ARGV;
$parser -> parsefile ( $filename );
This'll let you run xmlparse.pl <filename>
.
Upvotes: 1