Reputation: 5856
Want process a several html pages with tables.
The pages:
Question: How to find a correct table based on its cell value with Web::Scrape or Scrappy or another tool?
Example code:
#!/usr/bin/env perl
use 5.014;
use warnings;
use Web::Scraper;
use YAML;
my $html = do { local $/; <DATA> };
my $table = scraper {
#the easy way - table with class, or id or any attribute
#process 'table.xxx > tr', 'rows[]' => scraper {
#unfortunately, the table hasn't class='xxx', so :(
process 'NEED_HELP_HERE > tr', 'rows[]' => scraper {
process 'th', 'header' => 'TEXT';
process 'td', 'cols[]' => 'TEXT';
};
};
my $result = $table->scrape( $html );
say Dump($result);
__DATA__
<head><title>title</title></head>
<body>
<table><tr><th class="inverted">header</th><td>value</td></tr></table>
<!-- here are several another tables (different count) -->
<table> <!-- would be easy with some class="xxx" -->
<tr>
<th class="inverted">Content</th> <!-- Need this table - 1st cell == "Content" -->
<td class="inverted">col-1</td>
<td class="inverted">col-n</td>
</tr>
<tr>
<th>Date</th>
<td>2012</td>
<td>2001</td>
</tr>
<tr>
<th>Banana</th>
<td>val-1</td>
<td>val-n</td>
</tr>
</table>
</body>
</html>
Upvotes: 1
Views: 1339
Reputation: 39158
As usual, Web::Query wins for compactness. Unlike Scraper, it's not necessary to name the results, but if you want to, it's just one extra line.
use Web::Query qw();
Web::Query->new_from_html($html)
->find('th:contains("Content")')
->parent->parent->find('tr')->map(sub {
my (undef, $tr) = @_;
+{ $tr->find('th')->text => [$tr->find('td')->text] }
})
Expression returns
[
{Content => ['col-1', 'col-n']},
{Date => [2012, 2001]},
{Banana => ['val-1', 'val-n']}
]
Upvotes: 1
Reputation: 126732
You need to use an XPath expression to look at the text content of the nodes.
This should do the trick
my $table = scraper {
process '//table[tr[1]/th[1][normalize-space(text())="Content"]]/tr', 'rows[]' => scraper {
process 'th', 'header' => 'TEXT';
process 'td', 'cols[]' => 'TEXT';
};
};
It may look complex, but it's OK if you break it down.
It selects all <tr>
elements that are children of any <table>
element beneath the root for which the first <th>
element of the first <tr>
element contains a text element equal to "Content"
when normalized (leading and trailing spaces stripped).
output
---
rows:
- cols:
- col-1
- col-n
header: Content
- cols:
- 2012
- 2001
header: Date
- cols:
- val-1
- val-n
header: Banana
Upvotes: 4
Reputation: 670
HTML::TableExtract seems to be good for this problem.
Give this a try.
#!/usr/bin/Perl
use strict;
use warnings;
use lib qw( ..);
use HTML::TableExtract;
use LWP::Simple;
my $te = HTML::TableExtract->new( headers => [qw(Content)] );
my $content = get("http://www.example.com");
$te->parse($content);
foreach my $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows) {
print join(',', @$row), "\n";
}
}
If you change this line
my $te = HTML::TableExtract->new( headers => [qw(Content)] );
to
my $te = HTML::TableExtract->new();
It will return all of the tables. So you can fiddle around with that line if the above code block doesn't give you exactly what you're looking for.
Upvotes: 3