daliaessam
daliaessam

Reputation: 1666

Xpath get element name

I am using Perl module HTML::TreeBuilder::LibXML to parse HTML page code below:

<h2><a name="abbey">Abbey</a></h2>

<p>text paragraph 1</p>

<p>text paragraph 2</p>

<p>text paragraph 3</p>

<h2><a name="abbess">Abbess</a></h2>

<p>text paragraph 4</p>

<p>text paragraph 5</p>

<h2><a name="abbot">Abbot</a></h2>

<p>text paragraph 6</p>

<p>text paragraph 7</p>

<p>text paragraph 8</p>

How to get a list of the p tags text following each tag name marked by the h2. I mean I need an array of all the p tags text following the tag name, for example:

"Abbey" => (
text paragraph 1
text paragraph 2
text paragraph 3
)

"Abbess" => (
text paragraph 4
text paragraph 5
)

"Abbot" => (
text paragraph 6
text paragraph 7
text paragraph 8
)

So how to check the node name is "h2" or "p" if I loop over the tree nodes something like this:

foreach my $node ($tree->findnodes(...)){
 if $node is h2 ....
 if $node is p...
}

To build a hash of links and its paragraphs contents from the html code above.

Upvotes: 1

Views: 153

Answers (2)

Dada
Dada

Reputation: 6626

I would simply use find to find "h2" nodes, and then right to get the following "p" nodes:

use warnings;
use strict;

my $tree = HTML::TreeBuilder->new_from_content( "your HTML" );

my %links;
for my $h2 ($tree->find('h2')) {
    my $link = $h2->as_trimmed_text;
    my $current = $h2->right;
    while ($current && $current->tag eq 'p') {
        push @{$links{$link}}, $current->as_trimmed_text;
        $current = $current->right;
    }
}

(HTML::TreeBuilder can be replace by HTML::TreeBuilder::LibXML; this code will still work)

Upvotes: 1

daxim
daxim

Reputation: 39158

use Web::Query::LibXML 'wq';
my $html = ...;
my (%r, $key);
wq($html)->filter(sub { 1 })->each(sub {
    if ($_->match('h2')) {
        $key = $_->text;
    } elsif ($_->match('p')) {
        push @{ $r{$key} }, $_->text;
    }
});
__END__
(
    Abbess => ["text paragraph 4", "text paragraph 5"],
    Abbey  => ["text paragraph 1", "text paragraph 2", "text paragraph 3"],
    Abbot  => ["text paragraph 6", "text paragraph 7", "text paragraph 8"],
)

Upvotes: 0

Related Questions