DVK
DVK

Reputation: 129423

Does HTML::TreeBuilder somehow scrunch together all the table elements that are nested under BODY tag?

I was trying to parse some webpage's content using HTML::TreeBuilder and then do a manual XPath-like walk.

But I got something really weird.

This is the X-Path produced from the web page by Chrome's Developer Tools:

/html/body/table/tbody/tr/td[1]/table[3]/tbody/tr[1]/td[2]/
table[1]/tbody/tr[1]/td[2]/**table[9]** 

That last inner table #9 is what I need - more specifically, a cell that has "click to view" text in it.

Here's the developer tools screenshot - notice that BODY tag only has one table under it:

enter image description here

And if you drill down into that XPath you will see the element I seek (Notice it's really nested table within table within table - I included the TD element I seek):

enter image description here




HOWEVER, This is what HTML::TreeBuilder produced instead (Basically, a <body> tag containing 22 tags under it most of which are <table> tags:

  DB<16>  x $tree->tag
0  'body'

  DB<17>  x map {$_->tag} $tree->content_list
0  'table'
1  'table'
2  'table'
3  'table'
4  'table'
5  'table'
6  'table'
7  'table'
8  'table'
9  'table'
10  'table'
11  'table'
12  'table'
13  'table'
14  'table'
15  'table'
16  'table'
17  'table'
18  'table'
19  'script'
20  'table'
21  'table'

And as you can see, the 8th table under BODY TAG contains the element I want

  DB<37> foreach my $c (0 .. $tree->content_list-1) { 
           if (($tree->content_list)[$c]->as_HTML =~ /click to view/)
              {print $c+1}}
9

Upvotes: 0

Views: 386

Answers (1)

Borodin
Borodin

Reputation: 126722

It's most likely that the page you're processing contains invalid HTML. In that situation it's open season on how that content should actually be rendered, and different software will make different choices.

I'm afraid there isn't much you can do about it apart from either processing the HTML without the help of a parser, or perhaps finding the error and fixing it before you put it through HTML::TreeBuilder. Neither of these is a very pleasant prospect.

Upvotes: 0

Related Questions