Reputation: 15712
I need to parse a website which has a lot of nested <div>
s all over. I tried with XML::Simple
to get a nice tree-structure, but the parse fails all the time because there seems to be two or three not closed <p>
somewhere. I tried HTML::Parser
, but that only lets me define some handler functions that give me the right tags, but not their nested elements.
There any way to get XML::Simple
accept non-valid XML or HTML::Parser
to give me a handy tree structure?
Upvotes: 0
Views: 828
Reputation: 386621
But is it valid HTML? If so, XML::LibXML will do a marvelous job if you use the HTML parsing functions. It is lightning fast and provides a great interface. It should even be able to handle some bad HTML using the recover
option.
Alternatively, HTML::Parser (often used via HTML::TreeBuilder or HTML::TreeBuilder::XPath) is renown for handling bad HTML. It won't be as fast, though.
Upvotes: 3
Reputation: 8611
An alternative to something based on HTML::TreeBuilder is XML::LibXML->load_html(...).
Upvotes: 6
Reputation: 9697
The HTML::TreeBuilder builds nice trees and gives tons of handy methods to traverse it.
Upvotes: 6