Reputation: 101

Extracting information from HTML file using perl script

This is my code to extract certain data under the header Item Drop%. I want to extract the 90.5% under that header. But i can only extract the whole column not just that value. any idea ?

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'data.html';
unless ( -e $file ) {
    my $rc = getstore(
        'proj/Desktop/folder1/data.html',
        $file);
    die "Failed to download document\n" unless $rc == 200;
}



my $te = HTML::TableExtract->new( headers => qw(Item Drop%)]);

$te->parse_file($file);

my ($table) = $te->tables;

foreach my $ts (ts->tables) {
    print "Table (", join(',', $ts->coords), ");\n";
    foreach my $row ($ts->rows) {
        print join(',', @$row), "\n";
    }
}

My data.html is:


 ..
 ..
 ..
<table align = "center" class="" style= .......>
<tr>
<th rowspan="2">EM</th>
<th colspan="2"><a href= "proj/Desktop/folder1/data.html" class = ..../th>
<td> 90.5%</td>
</tr>
..
..
..
..
<tr>
<th rowspan="2">EM</th>
<th colspan="2"><a href= "proj/Desktop/folder1/data.html" class = ..../th>
<td> 40%</td>
</tr>

</table>

Upvotes: 1

Answers (1)

zdim

Reputation: 66883

Here are basics, with the given table fragment completed to a meaningful table.

use warnings;
use strict;
use feature 'say';

use HTML::TableExtract;
use Scalar::Util qw(looks_like_number);

my $filename = shift // die "Usage: $0 file\n";

my $te = HTML::TableExtract->new;
$te->parse_file($filename);    

my ($tbl) = $te->tables;  # one table in the sample file

my (@values1, @values2);  # for 90.5% and such, processing options

foreach my $row ($tbl->rows) { 
    # Tables often come with empty fields; keep them, for counting and such
    my @fields = map { defined($_) ? $_ : '--' } @$row;
    printf "%8s ", $_  for @fields;
    say ''; 

    # Criteria for how to identify the number aren't explained,
    # but may it be the fourth column in a row starting with 'EM'?
    if ($fields[0] =~ /^\s*EM\s*$/) {
        push @values1, $fields[3] =~ s/^\s*|\s*$//gr;  # see note in text
    }

    # Or is it simply the number ending with % sign?
    foreach my $fld (@fields) { 
        if ($fld =~ /\s*(.+)\s*%/ and looks_like_number($1)) { 
            push @values2, $1;
        }
    }
}
say "@values1";
say "@values2";

NOTE: The /r modifier in the regex was added in v5.14. If your Perl is older see footnote^†

Much of processing is shown for a demo. One needn't print those values (once you figure out where the items of interest are), and I wouldn't replace undef fields with --, done for clearer printout, but rather with '' (empty string). Also, we would need one criterion, not two.

Note that in the first case we keep the % sign, and use regex to clean up the spaces; in the second case the percentile sign is left out (and spaces end up cleaned out by regex naturally, in matching). These can both be adjusted to what you actually need of course.

Since neither the actual table nor data in it, nor the exact criteria, are given I can only offer hints and examples of code. With more detail this can be made more specific.

Note, when it comes to nailing down the last details, often by regex, things tend to become picky and sensitive to details; so careful.

The html file, completed off of the fragment in the question, used above:

<html>    
<style> th, td { padding: 10px } </style>  <!-- to better see it -->

<table align="center" rules="all">    
<tr>
    <th rowspan="2">EM</th>
    <th colspan="2"><a href="http://www.google.com">ggl</a></th>
    <td> 90.5%</td>
</tr>
<tr>
    <td>data</td> <td>more</td> <td>etc</td>
</tr>    

<tr>
<th rowspan="2">EM</th>
<th colspan="2"><a href="http://www.google.com">ggl</a></th>
<td> 0.0%</td>
</tr>
<tr> <td>data</td> <td>more</td> <td>etc</td> </tr>    
</table>    

</html>

^† The /r modifier on the substitution regex makes it return the changed string (and it leaves the original as it is). This is precisely what one wants in many situations, the one in the code here being a good example (we merely want to add the changed string to an array).

However, in Perls older than version 5.14, when this feature was introduced, one has to do otherwise: either create a new variable that has the change, if you need to keep the original unchanged, or change that original and then use it.

Since we are pruning extra spaces here it may well be best to simply change $fields[3] and then add it to the array. So instead of

if ($fields[0] =~ /^\s*EM\s*$/) {
    push @values1, $fields[3] =~ s/^\s*|\s*$//gr;
}

if ($fields[0] =~ /^\s*EM\s*$/) {
    $fields[3] =~ s/^\s*|\s*$//g;   # strip leading/trailing spaces
    push @values1, $fields[3];
}

Upvotes: 3

Extracting information from HTML file using perl script

Answers (1)

Related Questions