stevevls
stevevls

Reputation: 10853

Java library for cleaning up HTML just like a browser would

So here's the challenge... I need to create clean HTML from random web pages out there in the wild. My goal is to read in a page and pass it off to a library which will in turn give me back perfectly well-formed HTML.

Doesn't sound so tough, right? After all, every browser on the market effectively deals with the challenge of malformed HTML and turning it into something render-able with nearly every page load. Each has its own slightly particular algorithm for cleaning up the contents (ahem...for HTML < 5 that is), but they tend to do a very good job of capturing what i like to refer to as the author's intention. So then, why can't I find a good java library for this very task?

One thing to mention is that I'm not at all interested in parsing the HTML as XML. I've found that libraries such as NekoHTML, TagSoup, HtmlCleaner, and JTidy (to name a few) are more focused on solving the problem of converting to HTML to valid XML, and in the process, they lose sight of how the poorly-formatted document should be re-structured. With nasty HTML they frequently don't capture the author's intention and spit out documents that render quite differently from the original source. And for this project, it's of the utmost importance that the two documents render similarly.

I am quite fond of Jericho HTML, but it doesn't seem to be the ideal candidate for this job...at least not without a lot of effort on my part. Also, Native dependencies are a no-go, so the mozilla parser is out.

Can anyone help me in my search for the perfect HTML parser? Thanks in advance!

Upvotes: 14

Views: 8877

Answers (3)

Chris Nava
Chris Nava

Reputation: 6802

I have used HTML Tidy in the past.

Upvotes: 1

user240515
user240515

Reputation: 3207

TagSoup?

Upvotes: 0

Jigar Joshi
Jigar Joshi

Reputation: 240966

JSoup I would say

See Also

Upvotes: 7

Related Questions