Reputation: 1621
I want to extract all the tables from an html file and print their contents in the following way each cell seperated by \t
, each row separated by \n
and each table separated by \n\n
. The following is my script, when I changed it to findvalues on tr then whole tr is inserted as one element, and I even tried the other methods such as findnodes_as_strings ($path), I want to modify it to the above mentioned structure .
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file( "html.html");
my @values=$tree->findvalues(q{//table//tr//td});
print $_, "\n" foreach(@values);
Upvotes: 3
Views: 1204
Reputation: 10666
You need to process each table separately, same for rows:
foreach my $table ( $tree->findnodes('//table') ) {
foreach my $row ( $table->findnodes('.//tr') ) {
my @cells = $row->findvalues('.//td');
print join("\t", @cells), "\n";
}
print "\n";
}
Of course this is solution only for simple tables (think about columnspans, th, table inside table etc.)
Upvotes: 4