Reputation: 5223

What is the fastest way to remove html tags from a document in java?

I have bunch of web document and want to remove the html tags from it. I saw some posts on StackOverflow on how to do in java, all from regex to HtmlCleaner and Jsoup.

I am interested in finding the fastest way to do it. I have millions of documents, so performance is crucial in my case. I can even trade a bit of quality for the performance.

Thanks for any answers in advance.

Upvotes: 2

Answers (3)

user3111525

Reputation: 5223

Seems like the java regexp is the fastest solution. However, it degrades the quality of the text obtained after.

Upvotes: 0

Andy Petrella

Reputation: 4345

My opinion is to use as much as possible stream/SAX processing: 1) because it uses less memory 2) it is fast 3) can be more easier parallelized (consequence of low memory consumption)

Those factors are needed (from my pov) by your use cases where you have million of documents. please see there Wikipedia SAX

So if your Html is strict or XHTML. Use XSLT, and here is a tuto on how to transform XML (XHTML) using SAX XSLT+SAX+Java.

And finally, if you DON'T have an XML valid HTML please, look at this Java: Replace Strings in Streams, Arrays, Files etc. which make use of stream (and PushBackReader).

HTH

Upvotes: 1

dinesh028

Reputation: 2187

1) if html is proper xml then you can create its document object and remove the node.

2) if it is not proper xml then read entire html as string & and use replace function to remove "html" sunbstring.

If HTMl is not proper xml then regex is fastest way to replace in a string.

Upvotes: 0

What is the fastest way to remove html tags from a document in java?

Answers (3)

Related Questions