Reputation: 1666
I am using Perl module HTML::TreeBuilder::LibXML to parse HTML page code below:
<h2><a name="abbey">Abbey</a></h2>
<p>text paragraph 1</p>
<p>text paragraph 2</p>
<p>text paragraph 3</p>
<h2><a name="abbess">Abbess</a></h2>
<p>text paragraph 4</p>
<p>text paragraph 5</p>
<h2><a name="abbot">Abbot</a></h2>
<p>text paragraph 6</p>
<p>text paragraph 7</p>
<p>text paragraph 8</p>
How to get a list of the p tags text following each tag name marked by the h2. I mean I need an array of all the p tags text following the tag name, for example:
"Abbey" => (
text paragraph 1
text paragraph 2
text paragraph 3
)
"Abbess" => (
text paragraph 4
text paragraph 5
)
"Abbot" => (
text paragraph 6
text paragraph 7
text paragraph 8
)
So how to check the node name is "h2" or "p" if I loop over the tree nodes something like this:
foreach my $node ($tree->findnodes(...)){
if $node is h2 ....
if $node is p...
}
To build a hash of links and its paragraphs contents from the html code above.
Upvotes: 1
Views: 153
Reputation: 6626
I would simply use find
to find "h2" nodes, and then right
to get the following "p" nodes:
use warnings;
use strict;
my $tree = HTML::TreeBuilder->new_from_content( "your HTML" );
my %links;
for my $h2 ($tree->find('h2')) {
my $link = $h2->as_trimmed_text;
my $current = $h2->right;
while ($current && $current->tag eq 'p') {
push @{$links{$link}}, $current->as_trimmed_text;
$current = $current->right;
}
}
(HTML::TreeBuilder
can be replace by HTML::TreeBuilder::LibXML
; this code will still work)
Upvotes: 1
Reputation: 39158
use Web::Query::LibXML 'wq';
my $html = ...;
my (%r, $key);
wq($html)->filter(sub { 1 })->each(sub {
if ($_->match('h2')) {
$key = $_->text;
} elsif ($_->match('p')) {
push @{ $r{$key} }, $_->text;
}
});
__END__
(
Abbess => ["text paragraph 4", "text paragraph 5"],
Abbey => ["text paragraph 1", "text paragraph 2", "text paragraph 3"],
Abbot => ["text paragraph 6", "text paragraph 7", "text paragraph 8"],
)
Upvotes: 0