Reputation: 270
Trying to make a Perl script to open an HTML file and extract anything contained within <span class="postertrip">
tags.
Sample HTML:
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply2">
<a name="2"></a> <label><input type="checkbox" name="delete" value="1199313466,2" /> <span class="replytitle"></span> <span class="commentpostername"><a href="test">Test1</a></span><span class="postertrip"><a href="test">!AAAAAAAA</a></span> 08/01/03(Thu)02:06</label> <span class="reflink"> <a href="test">No.2</a> </span> <br /> <span class="filesize">File: <a target="_blank" href="test">1199326003295.jpg</a> -(<em>65843 B, 288x412</em>)</span> <span class="thumbnailmsg">Thumbnail displayed, click image for full size.</span><br /> <a target="_blank" test"> <img src="test" width="139" height="200" alt="65843" class="thumb" /></a>
<blockquote>
<p>Test message 1</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply5">
<a name="5"></a> <label><input type="checkbox" name="delete" value="1199313466,5" /> <span class="replytitle"></span> <span class="commentpostername">Test2</span><span class="postertrip">!BBBBBBBB</span> 08/01/03(Thu)16:12</label> <span class="reflink"> <a href="test">No.5</a> </span>
<blockquote>
<p>Test message 2</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply7">
<a name="7"></a> <label><input type="checkbox" name="delete" value="1199161229,7" /> <span class="replytitle"></span> <span class="commentpostername">Test3</span><span class="postertrip">!CCCCCCCC.</span> 08/01/01(Tue)17:53</label> <span class="reflink"> <a href="test">No.7</a> </span>
<blockquote>
<p>Test message 3</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
Desired output:
!AAAAAAAA
!BBBBBBBB
!CCCCCCCC
Current script:
#!/usr/bin/env perl
use warnings;
use strict;
use 5.010;
use HTML::TreeBuilder;
open(my $html, "<", "temp.html")
or die "Can't open";
my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);
foreach my $e ($tree->look_down('class', 'reply')) {
my $e = $tree->look_down('class', 'postertrip');
say $e->as_text;
}
Bad output of script:
!AAAAAAAA
!AAAAAAAA
!AAAAAAAA
Upvotes: 3
Views: 529
Reputation: 132858
I've never liked HTML::TreeBuilder. It's a bit of a complicated mess, and it hasn't been updated in three years. Using CSS selectors with Mojo::DOM is pretty easy though. Its find
does all that work that the various look_down
s do:
use v5.10;
use Mojo::DOM;
my $html = do { local $/; <DATA> };
my @values = Mojo::DOM->new( $html )
->find( 'td.reply span.postertrip' )
->map( 'all_text' )
->each;
say join "\n", @values;
Note that in your HTML::TreeBuilder code, you don't have the logic to select the tags you care about. You can do it but you need extra work. The CSS selectors take care of that for you.
Upvotes: 5
Reputation: 2341
in your foreach-loop you have to look down from the element you found. So the correct code is:
foreach my $parent ($tree->look_down('class', 'reply')) {
my $e = $parent->look_down('class', 'postertrip');
say $e->as_text;
}
Upvotes: 5