Reputation: 5223
I have bunch of web document and want to remove the html tags from it. I saw some posts on StackOverflow on how to do in java, all from regex to HtmlCleaner and Jsoup.
I am interested in finding the fastest way to do it. I have millions of documents, so performance is crucial in my case. I can even trade a bit of quality for the performance.
Thanks for any answers in advance.
Upvotes: 2
Views: 1458
Reputation: 5223
Seems like the java regexp is the fastest solution. However, it degrades the quality of the text obtained after.
Upvotes: 0
Reputation: 4345
My opinion is to use as much as possible stream/SAX processing: 1) because it uses less memory 2) it is fast 3) can be more easier parallelized (consequence of low memory consumption)
Those factors are needed (from my pov) by your use cases where you have million of documents. please see there Wikipedia SAX
So if your Html is strict or XHTML. Use XSLT, and here is a tuto on how to transform XML (XHTML) using SAX XSLT+SAX+Java.
And finally, if you DON'T have an XML valid HTML please, look at this Java: Replace Strings in Streams, Arrays, Files etc. which make use of stream (and PushBackReader).
HTH
Upvotes: 1
Reputation: 2187
1) if html is proper xml then you can create its document object and remove the node.
2) if it is not proper xml then read entire html as string & and use replace function to remove "html" sunbstring.
If HTMl is not proper xml then regex is fastest way to replace in a string.
Upvotes: 0