Ashkan Abedinkhan
Ashkan Abedinkhan

Reputation: 35

How to read the values of a table in HTML file and Store them in Perl?

I read many questions and many answers but I couldn't find a straight answer to my question. All the answers were either very general or different from what I want to do. I got so far that i need to use HTML::TableExtract or HTML::TreeBuilder::XPath but I couldn't really use them to store the values. I could somehow get table row values and show them with Dumper.

Something like this:

foreach my $ts ($tree->table_states) {
 foreach my $row ($ts->rows) { 
   push (@fir , (Dumper $row)); 
} }
print @sec;

But this is not really doing what I'm looking for. I will add the structure of the HTML table that I want to store the values:

<table><caption><b>Table 1 </b>bla bla bla</caption>
<tbody>
    <tr>
        <th ><p>Foo</p>
        </th>

        <td ><p>Bar</p>
        </td>

    </tr>

    <tr>
        <th ><p>Foo-1</p>
        </th>

        <td ><p>Bar-1</p>
        </td>

    </tr>

    <tr>
        <th ><p>Formula</p>
        </th>

        <td><p>Formula1-1</p>
            <p>Formula1-2</p>
            <p>Formula1-3</p>
            <p>Formula1-4</p>
            <p>Formula1-5</p>
        </td>

    </tr>

    <tr>
        <th><p>Foo-2</p>
        </th>

        <td ><p>Bar-2</p>
        </td>

    </tr>

    <tr>
        <th ><p>Foo-3</p>
        </th>

        <td ><p>Bar-3</p>
             <p>Bar-3-1</p>
        </td>

    </tr>

</tbody>

</table>

It would be convenient if I can store the row values as pairs together.

expected output would be something like an array with values of: (Foo , Bar , Foo-1 , Bar-1 , Formula , Formula-1 Formula-2 Formula-3 Formula-4 Formula-5 , ....) The important thing for me is to learn how to store the values of each tag and how to move around in the tag tree.

Upvotes: 1

Views: 201

Answers (1)

daxim
daxim

Reputation: 39158

Learn XPath and DOM manipulation.

use strictures;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new;
$dom->parse_file('10280979.html');

my %extract;
@extract{$dom->findnodes_as_strings('//th')} =
    map {[$_->findvalues('p')]} $dom->findnodes('//td');
__END__
# %extract = (
#     Foo     => [qw(Bar)],
#     'Foo-1' => [qw(Bar-1)],
#     'Foo-2' => [qw(Bar-2)],
#     'Foo-3' => [qw(Bar-3 Bar-3-1)],
#     Formula => [qw(Formula1-1 Formula1-2 Formula1-3 Formula1-4 Formula1-5)],
# )

Upvotes: 3

Related Questions