Reputation: 1571
Does somebody know an alternative of JSoup?
Or how to clean sequences like <p> </p>
?
HTML Clean plug-in for jQuery works well for me, but I'm interested in doing the html code cleaning at server side, not in the client side.
Or, what is the replaceAll expression to do??:
String cleanS = dirtyS.replaceAll("<p> </p>", ""); //This doesnt work
I have discovered that the dirty html comes with mixed sequences of blank spaces #160, and others like #32.
So, what I need is a expression to remove whatever mixture of them.
Upvotes: 4
Views: 11589
Reputation: 25340
You can change the OutputSettings
for this:
Example:
final String html = ...;
OutputSettings settings = new OutputSettings();
settings.escapeMode(Entities.EscapeMode.xhtml);
String cleanHtml = Jsoup.clean(html, "", Whitelist.relaxed(), settings);
This is possible with a Document
parsed by Jsoup too:
Document doc = Jsoup.parse(...);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
// ...
Edit:
Removing tags:
doc.select("p:matchesOwn((?is) )").remove();
Please note: after (?is)
there's not a blank, but char #160 (= nbsp).
This will remove all p-Tags whose own text is only a
. If you want do so with all other tags, you can replace the p:
with *:
.
Upvotes: 8
Reputation:
If you have the document object, you can loop over the paragrap elements and remove all those that don't have text (or non white space text) in them. before checking if the text is empty, you can replace the occurrences of NBSP; with white space. Assuming your working ith UTF-8 documents the following might work for you:
public static final String NBSP_IN_UTF8 = "\u00a0";
Assuming you know how to get the Document object, the loop to clean is simple: select the paragraph elements and remove empty ones:
org.jsoup.nodes.Document doc= ... //obtain your document object
for (org.jsoup.nodes.Element element : doc.select("p")) {
if ( !element.hasText() || element.text().replaceAll(NBSP_IN_UTF8, "").trim().equals("") ) {
element.remove();
}
}
Upvotes: 1