Reputation: 5856
Have a html-page, with a structure:
id="searchResult"
td
- without any classTried different XPATH scrapers like:
my $links = scraper {
process '//table[id="searchResult"]', "lines[]" => scraper {
process "//tr/td[2]/a", text => 'TEXT';
process "//tr/td[2]/a", link => '@href';
};
};
my $res = $links->scrape($html);
But not works and the $res is an empty {}
.
If someone needs, here is the full test code:
use 5.014;
use warnings;
use Web::Scraper;
use Data::Dumper;
my $links = scraper {
process '//table[id="searchResult"]', "lines[]" => scraper {
process "//tr/td[2]/a", text => 'TEXT';
process "//tr/td[2]/a", link => '@href';
};
};
my $html = do {local $/;<DATA>};
#say $html;
my $res = $links->scrape($html);
say Dumper $res;
__DATA__
<html>
<body>
<p>...</p>
<table id="searchResult">
<thead><th>x</th><th>x</th><th>x</th><th>x</th><th>x</th></thead>
<tr>
<td><a href="#11">cell11</a></td>
<td><a href="#12">cell12</a></td>
<td><a href="#13">cell13</a></td>
</tr>
<tr>
<td><a href="#21">cell21</a></td>
<td><a href="#22">cell22</a></td>
<td><a href="#23">cell23</a></td>
</tr>
<tr>
<td><a href="#31">cell31</a></td>
<td><a href="#32">cell32</a></td>
<td><a href="#33">cell33</a></td>
</tr>
</table>
</body>
</html>
Upvotes: 1
Views: 228
Reputation: 35198
My preferred scraper for this type of project is Mojo::DOM
. For a helpful 8 minute introductory video, check out Mojocast Episode 5.
You also could probably use a pointer to a CSS Selector Reference.
The following performs the parsing you're trying to do with this module:
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $link ($dom->find('table[id=searchResult] > tr > td:nth-child(2) > a')->each) {
print $link->{href}, " - ", $link->text, "\n";
}
__DATA__
<html>
<body>
<p>...</p>
<table id="searchResult">
<thead><th>x</th><th>x</th><th>x</th><th>x</th><th>x</th></thead>
<tr>
<td><a href="#11">cell11</a></td>
<td><a href="#12">cell12</a></td>
<td><a href="#13">cell13</a></td>
</tr>
<tr>
<td><a href="#21">cell21</a></td>
<td><a href="#22">cell22</a></td>
<td><a href="#23">cell23</a></td>
</tr>
<tr>
<td><a href="#31">cell31</a></td>
<td><a href="#32">cell32</a></td>
<td><a href="#33">cell33</a></td>
</tr>
</table>
</body>
</html>
Outputs:
#12 - cell12
#22 - cell22
#32 - cell32
Upvotes: 3