Neon Flash
Neon Flash

Reputation: 3233

Perl HTML::Treebuilder XPATH Table Tags with no ID/Name

I want to extract some text which is present in a specific table cell in the HTML page.

Now, the problem is, this cell is present inside a table tag which has no ID/Name.

I am using HTML::TreeBuilder::XPath to extract the value using XPATH expressions.

Here is how the HTML content looks like:

<table border="0">
<tr>
<td>Some Text</td>
<td>The Text I want comes here</td>
</tr>

This is how my XPATH expression looks like:

@nodes=$tree->findnodes(q{//table[8]/tr/td[2]/text()});
print $_->string_value."\n" foreach(@nodes); # corrected, thanks mirod.

It does not display the output.

I have used, table[8] above since this is the eight table tag in the HTML page (assuming the index starts from 1).

Also, I have used td[2] since I want the innerHTML between the second td tag.

Thanks.

Upvotes: 2

Views: 1091

Answers (2)

gangabass
gangabass

Reputation: 10666

The mirod approach should work for you.

But I recommend to use findvalues instead of findnodes if you need text content.

Try to run this code and show output:

my @values=$tree->findvalues(q{//table[8]//tr[1]//td});
print $_, "\n" foreach(@values);

Upvotes: 1

mirod
mirod

Reputation: 16161

What happens if you remove the text() at the end of the XPath query? I would think that calling string_value on the td itself would be enough.

Also method calls are not interpolated in strings, so you need to write print $_->string_value, "\n".

This will give you the text of the content, not the markup though. For that you would need to use as_HTML, and strip the outer tags (there is no method in HTML::Element that gives you the inner HTML):

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $tree= HTML::TreeBuilder::XPath->new_from_content( <DATA>);

my @nodes=$tree->findnodes(q{//table[1]/tr/td[2]});
print $_->string_value, "\n" foreach(@nodes); # text
print $_->as_HTML, "\n" foreach(@nodes);      # outerHTML



__DATA__
<html>
<body>
<table border="0">
<tr>
<td>Some Text</td>
<td>The Text I want comes here with <b>nested</b> content</td>
</tr>
</body>
</html>

Upvotes: 4

Related Questions