Xpath get element name

Question

I am using Perl module HTML::TreeBuilder::LibXML to parse HTML page code below:

Abbey

text paragraph 1

text paragraph 2

text paragraph 3

Abbess

text paragraph 4

text paragraph 5

Abbot

text paragraph 6

text paragraph 7

text paragraph 8

How to get a list of the p tags text following each tag name marked by the h2. I mean I need an array of all the p tags text following the tag name, for example:

"Abbey" => (
text paragraph 1
text paragraph 2
text paragraph 3
)

"Abbess" => (
text paragraph 4
text paragraph 5
)

"Abbot" => (
text paragraph 6
text paragraph 7
text paragraph 8
)

So how to check the node name is "h2" or "p" if I loop over the tree nodes something like this:

foreach my $node ($tree->findnodes(...)){
 if $node is h2 ....
 if $node is p...
}

To build a hash of links and its paragraphs contents from the html code above.

Dada · Accepted Answer

I would simply use find to find "h2" nodes, and then right to get the following "p" nodes:

use warnings;
use strict;

my $tree = HTML::TreeBuilder->new_from_content( "your HTML" );

my %links;
for my $h2 ($tree->find('h2')) {
    my $link = $h2->as_trimmed_text;
    my $current = $h2->right;
    while ($current && $current->tag eq 'p') {
        push @{$links{$link}}, $current->as_trimmed_text;
        $current = $current->right;
    }
}

(HTML::TreeBuilder can be replace by HTML::TreeBuilder::LibXML; this code will still work)

Xpath get element name

Answers (2)

Related Questions