Reputation: 15712

Parsing HTML which is not valid XML

I need to parse a website which has a lot of nested <div>s all over. I tried with XML::Simple to get a nice tree-structure, but the parse fails all the time because there seems to be two or three not closed <p> somewhere. I tried HTML::Parser, but that only lets me define some handler functions that give me the right tags, but not their nested elements.

There any way to get XML::Simple accept non-valid XML or HTML::Parser to give me a handy tree structure?

Upvotes: 0

Answers (3)

ikegami

Reputation: 386621

But is it valid HTML? If so, XML::LibXML will do a marvelous job if you use the HTML parsing functions. It is lightning fast and provides a great interface. It should even be able to handle some bad HTML using the recover option.

Alternatively, HTML::Parser (often used via HTML::TreeBuilder or HTML::TreeBuilder::XPath) is renown for handling bad HTML. It won't be as fast, though.

Upvotes: 3

reinierpost

Reputation: 8611

An alternative to something based on HTML::TreeBuilder is XML::LibXML->load_html(...).

Upvotes: 6

bvr

Reputation: 9697

The HTML::TreeBuilder builds nice trees and gives tons of handy methods to traverse it.

Upvotes: 6

Parsing HTML which is not valid XML

Answers (3)

Related Questions