Reputation: 3233
I want to extract some text which is present in a specific table cell in the HTML page.
Now, the problem is, this cell is present inside a table tag which has no ID/Name.
I am using HTML::TreeBuilder::XPath to extract the value using XPATH expressions.
Here is how the HTML content looks like:
<table border="0">
<tr>
<td>Some Text</td>
<td>The Text I want comes here</td>
</tr>
This is how my XPATH expression looks like:
@nodes=$tree->findnodes(q{//table[8]/tr/td[2]/text()});
print $_->string_value."\n" foreach(@nodes); # corrected, thanks mirod.
It does not display the output.
I have used, table[8] above since this is the eight table tag in the HTML page (assuming the index starts from 1).
Also, I have used td[2] since I want the innerHTML between the second td tag.
Thanks.
Upvotes: 2
Views: 1091
Reputation: 10666
The mirod approach should work for you.
But I recommend to use findvalues
instead of findnodes
if you need text content.
Try to run this code and show output:
my @values=$tree->findvalues(q{//table[8]//tr[1]//td});
print $_, "\n" foreach(@values);
Upvotes: 1
Reputation: 16161
What happens if you remove the text()
at the end of the XPath query? I would think that calling string_value
on the td itself would be enough.
Also method calls are not interpolated in strings, so you need to write print $_->string_value, "\n"
.
This will give you the text of the content, not the markup though. For that you would need to use as_HTML
, and strip the outer tags (there is no method in HTML::Element that gives you the inner HTML):
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_content( <DATA>);
my @nodes=$tree->findnodes(q{//table[1]/tr/td[2]});
print $_->string_value, "\n" foreach(@nodes); # text
print $_->as_HTML, "\n" foreach(@nodes); # outerHTML
__DATA__
<html>
<body>
<table border="0">
<tr>
<td>Some Text</td>
<td>The Text I want comes here with <b>nested</b> content</td>
</tr>
</body>
</html>
Upvotes: 4