CODEWITHSUNDEEP

Reputation: 116990

Any good Java HTML parsers?

I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?

I've tried Cobra's built in one and HTMLCleaner without any luck.

Upvotes: 0

Views: 1489

Answers (5)

Reputation: 15993

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)

Upvotes: 1

peter.murray.rust

Reputation: 38073

[Answering the title - the overall question and comments are not consistsent]

JTidy (http://jtidy.sourceforge.net/) is a port of Dave Raggett's HTMLTidy. It's very useful though I think development may have slowed/ceased.

Upvotes: 1

Reputation: 570595

TagSoup is really great when dealing with crappy HTML/XHTML.

Jericho (and NekoHTML) are good too to parse non valid HTML.

TagSoup and Jericho: tried-and-tested. NekoHTML: feedback from trustable source.

Upvotes: 4

Pavel Minaev

Reputation: 101665

Mozilla HTML Parser looks rather interesting. By definition, it's supposed to be as good as Gecko engine itself, which is likely to cover your needs.

Upvotes: 1

Reputation: 86774

Take a look at Saxon (no, I'm not involved in any way with the product, just a satisfied user).

Upvotes: 1

Related Questions