dreamer
dreamer

Reputation: 478

HTML parsing by perl script

I am trying to parse an HTML file through my perl script. I am using a module called HTML::TreeBuilder.

Here is what I have so far:

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new; 

$tree->parse_file("sample.html");

foreach my $anchor ($tree->find("p")) {

  print $anchor->as_text, "\n";

}

It is working fine. I am getting everything inside < p> tag.

sample.html file:

< td>Release Version:< /td>< td> 5134< /td>< /tr>

< tr class="d0">< td>Executed By:< /td>< td>spoddar< /td>< /tr>

< tr class="d1">< td> Duration:< /td>< td>0 Hrs 0 Mins 0 Secs < /td>< /tr>

< tr class="d0">< td>#TCs Executed:< /td>< td>1< /td>< /tr>

I want 5134 to be printed when i pass Release Version. In the same way I want spoddar to be printed when i pass Execute By. These are not HTML tags. But is there any way to obtain this?

Upvotes: 1

Views: 4305

Answers (2)

marquezc329
marquezc329

Reputation: 71

HTML::Parser and HTML::TokeParser may also be of use to you.


UNTESTED

use HTML::TokeParser;

my $p = HTML::TokeParser->new('sample.html');

while (my $token = $p->get_token) {
    my $tokenType = shift @{$token}; # 'S' is start tag 'E' end tag etc. (see doc)
    if ($tokenType =~ /S/) {
        my ($tag, $attr, $attrseq, $rawtxt) = @{$token};
        my $class = $attr->{class}; #get tag class
        if ($class =~ /d0/ && $tag =~ /tr/) {
            print "$p->get_trimmed_text('/tr')\n";
        }
    }
}

Upvotes: 2

stevenl
stevenl

Reputation: 6798

The most straightforward thing to do is to filter the tags you want and look through the text. The following approach assumes the format you have in the sample, with a 2-column table.

sub get_value {
    my $key = shift;

    foreach my $tr ($tree->find('tr')) {
        my @td = $tree->find('td');
        return $td[1]->as_text if $td[0]->as_text eq $key;
    }
    return;
}

print get_value('Release Version:');

Upvotes: 3

Related Questions