Ibrahim Hassan
Ibrahim Hassan

Reputation: 59

counting instances between two tags

I have an XML file with multiple tags and I want to count the years between certain tags like this

    <Dateline>08/Dec./2009</Dateline>

I simply want to get how many 2009s and 2010s and so on the day and the month are not important, I want it like this

Sample XML:

<Sabanews> 
    <ID>SBN_ARB_0000001</ID> 
    <Start URL>sabanews.net/ar/news200024.htm</Start URL> 
    <Headline>الكونجرس الأمريكي يطالب المجتمع الدولي دعم اليمن لمواجهة التحديات القائمة</Headline> 
    <Dateline>08/ديسمبر/2009</Dateline> 
    <Text> واشنطن ـ سبأنت: طالب الكون المزعزعة للاستقرار والعو اليمنيين خصوصا أن يعملوا معا لمجابهة التحديات القائمة". سبأ</Text> 
</Sabanews>

Upvotes: 0

Views: 55

Answers (1)

Sobrique
Sobrique

Reputation: 53498

This is a simplification, because your specification is vague. Firm it up a bit, and I may clarify/extend. As is, treat as an example of an approach that could be taken.

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
my %count_of;

sub extract_date {
    my ( $twig, $dateline ) = @_;
    my $date_string = $dateline->text;
    print $date_string,"\n";
    my ($year) = ( $date_string =~ m#/(\d+)$# );
    $count_of{$year}++;
}

my $parser = XML::Twig->new( twig_roots => { 'Dateline' => \&extract_date } );
#probably want parsefile here in your real world code.
$parser->parse( \*DATA );


foreach my $date ( sort keys %count_of ) {
    print $date, " >> ", $count_of{$date}, "\n";
}


__DATA__
<XML>
<Dateline>01/Dec./2009</Dateline>
<Dateline>02/Dec./2009</Dateline>
<Dateline>03/Dec./2020</Dateline>
<Dateline>04/Dec./2015</Dateline>
<Dateline>05/Dec./2015</Dateline>
</XML>

We set a handler, that's triggered each time we see a 'Dateline' element, and ignore anything else.

This handler extracts the text from the element, uses a regular expression to extract the year, and then adds it to %count_of. Which we print afterwards.

Gives:

01/Dec./2009
02/Dec./2009
03/Dec./2020
04/Dec./2015
05/Dec./2015
2009 >> 2
2015 >> 2
2020 >> 1

Edit: Given the new sample XML - you need something slightly different to the above. The approach should still work though.

Google translate tells me that ديسمبر is "December" so it' still a date. You may find that using Time::Piece parses it correctly as that should support locale.

Otherwise you need to extract your 'year' with:

my ($year) = ( $date_string =~ m#^\d+/(\d+)/\w+$# );

Edit: To handle 'command line' specification of filename:

my ( $filename ) = @ARGV;

$parser -> parsefile ( $filename ); 

This'll let you run xmlparse.pl <filename>.

Upvotes: 1

Related Questions