alternative of JSoup or how to clean whitespaces

Question

Does somebody know an alternative of JSoup?

Or how to clean sequences like

?

HTML Clean plug-in for jQuery works well for me, but I'm interested in doing the html code cleaning at server side, not in the client side.

Or, what is the replaceAll expression to do??:

String cleanS = dirtyS.replaceAll(" ", ""); //This doesnt work

I have discovered that the dirty html comes with mixed sequences of blank spaces #160, and others like #32.

So, what I need is a expression to remove whatever mixture of them.

mix space blank

ollo · Accepted Answer

You can change the OutputSettings for this:

Example:

final String html = ...;


OutputSettings settings = new OutputSettings();
settings.escapeMode(Entities.EscapeMode.xhtml);

String cleanHtml = Jsoup.clean(html, "", Whitelist.relaxed(), settings);

This is possible with a Document parsed by Jsoup too:

Document doc = Jsoup.parse(...);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);

// ...

Edit:

Removing tags:

doc.select("p:matchesOwn((?is) )").remove();

Please note: after (?is) there's not a blank, but char #160 (= nbsp). This will remove all p-Tags whose own text is only a . If you want do so with all other tags, you can replace the p: with *:.

alternative of JSoup or how to clean whitespaces

Answers (2)

Related Questions