Reputation: 116990
I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?
I've tried Cobra's built in one and HTMLCleaner without any luck.
Upvotes: 0
Views: 1489
Reputation: 15993
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)
Upvotes: 1
Reputation: 38073
[Answering the title - the overall question and comments are not consistsent]
JTidy (http://jtidy.sourceforge.net/) is a port of Dave Raggett's HTMLTidy. It's very useful though I think development may have slowed/ceased.
Upvotes: 1
Reputation: 570595
TagSoup is really great when dealing with crappy HTML/XHTML.
Jericho (and NekoHTML) are good too to parse non valid HTML.
TagSoup and Jericho: tried-and-tested. NekoHTML: feedback from trustable source.
Upvotes: 4
Reputation: 101665
Mozilla HTML Parser looks rather interesting. By definition, it's supposed to be as good as Gecko engine itself, which is likely to cover your needs.
Upvotes: 1
Reputation: 86774
Take a look at Saxon (no, I'm not involved in any way with the product, just a satisfied user).
Upvotes: 1