Gouri
Gouri

Reputation: 11

Extracting from repetitive multi level tags containing repetitive tags using Perl

I have an XML File (edited).

    <xml>
        <PubmedData>
            <History>
                <PubMedPubDate PubStatus="entrez">
                    <Year>2010</Year>
                    <Month>6</Month>
                    <Day>18</Day>
                    <Hour>6</Hour>
                    <Minute>0</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="pubmed">
                    <Year>2010</Year>
                    <Month>7</Month>
                    <Day>19</Day>
                    <Hour>6</Hour>
                    <Minute>10</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="medline">
                    <Year>2010</Year>
                    <Month>8</Month>
                    <Day>20</Day>
                    <Hour>7</Hour>
                    <Minute>0</Minute>
                </PubMedPubDate>
            <PublicationStatus>aheadofprint</PublicationStatus>
            <Initials>JJ</Initials>
            <NlmUniqueID>8434563</NlmUniqueID>
            </History>  
            <History>
                <PubMedPubDate PubStatus="entrez">
                    <Year>2011</Year>
                    <Month>4</Month>
                    <Day>18</Day>
                    <Hour>10</Hour>
                    <Minute>20</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="pubmed">
                    <Year>2011</Year>
                    <Month>7</Month>
                    <Day>24</Day>
                    <Hour>8</Hour>
                    <Minute>10</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="medline">
                    <Year>2011</Year>
                    <Month>3</Month>
                    <Day>4</Day>
                    <Hour>5</Hour>
                    <Minute>37</Minute>
                </PubMedPubDate>
            <PublicationStatus>aheadofprint</PublicationStatus>
            <Initials>BP</Initials>
            <NlmUniqueID>9814863</NlmUniqueID>
            </History>
        </PubmedData>
    </xml>

I want to extract everything under the History tag and get the list of different year, month, day, hour and minutes? I was able to parse a simple XML file using XML::Simple and get the output but I am not able to extract information off of repetitive multi level tags containing repetitive tags. Please help me figure it out.

Thanks, Gouri

Upvotes: 0

Views: 762

Answers (3)

Dimanoid
Dimanoid

Reputation: 7289

You can use XML::TreeBuilder, something like this:

use XML::TreeBuilder;                                                                                                                                                                                              

my $root= XML::TreeBuilder->new();                                                                                                                                         
$root->parse($xml);                                                                                                                                                                                                

my @history=$root->look_down(_tag=>'PubMedPubDate');                                                                                                                                                               
foreach my $h (@history) {                                                                                                                                                                                         
    printf "%s: %d-%d-%d %d:%d\n", $h->attr('PubStatus'),                                                                                                                                                          
        $h->look_down(_tag => Year)->as_text,                                                                                                                                                                      
        $h->look_down(_tag => Month)->as_text,                                                                                                                                                                     
        $h->look_down(_tag => Day)->as_text,                                                                                                                                                                       
        $h->look_down(_tag => Hour)->as_text,                                                                                                                                                                      
        $h->look_down(_tag => Minute)->as_text;                                                                                                                                                                    
}

you will got following as output:

entrez: 2010-6-18 6:0
pubmed: 2010-7-19 6:10
medline: 2010-8-20 7:0
entrez: 2011-4-18 10:20
pubmed: 2011-7-24 8:10
medline: 2011-3-4 5:37

note: you need 1 root tag in the document, so just wrap it with <xml></xml> for example

Upvotes: 1

killzone
killzone

Reputation: 91

It can be done well

use XML::Simple;
use Data::Dumper;
use IO::File;

my $File = IO::File->new('File.xml');
my $XML = XML::Simple->new;
my $ref = $XML->XMLin($File);

$i = $j = 0;

for (;;){

    if($ref->{PubmedData}->[$j]->{History}->{PubMedPubDate}->[$i] =~ /^HASH/){
        print "-" x 70 . "\n";
        print "Year   : " . $ref->{PubmedData}->[$j]->{History}->{PubMedPubDate}->[$i]->{Year}   . "\n";
        print "Month  : " . $ref->{PubmedData}->[$j]->{History}->{PubMedPubDate}->[$i]->{Month}  . "\n";
        print "Day    : " . $ref->{PubmedData}->[$j]->{History}->{PubMedPubDate}->[$i]->{Day}    . "\n";
        $i++;
    }else{
        $j++;
        $i = 0;
        unless($ref->{PubmedData}->[$j]->{History}->{PubMedPubDate}->[$i] =~ /^HASH/){
            last;
        }

    }

}

out :

----------------------------------------------------------------------
Year   : 2010
Month  : 6
Day    : 18
----------------------------------------------------------------------
Year   : 2010
Month  : 7
Day    : 19
----------------------------------------------------------------------
Year   : 2010
Month  : 8
Day    : 20
----------------------------------------------------------------------
Year   : 2011
Month  : 4
Day    : 18
----------------------------------------------------------------------
Year   : 2011
Month  : 7
Day    : 24
----------------------------------------------------------------------
Year   : 2011
Month  : 3
Day    : 4

Upvotes: 0

Cornel Ghiban
Cornel Ghiban

Reputation: 902

The following code works when you have one <PubmedData> tag:

use strict;

use XML::Simple();
use Data::Dumper;

my $xml = '';
while (<DATA>) {
    $xml .= $_;
}

my $x = XML::Simple->new;
my $doc = $x->XMLin($xml);

for my $date (@{$doc->{History}->{PubMedPubDate}}) {
    print sprintf("%d-%02d-%02d", $date->{Year}, $date->{Month}, $date->{Day}), "\n";
}

__DATA__
<PubmedData>
...
</PubmedData>

For more tags, you'll have to enclose everything into another container.

Upvotes: 0

Related Questions