JLLMNCHR
JLLMNCHR

Reputation: 1571

alternative of JSoup or how to clean whitespaces

Does somebody know an alternative of JSoup?

Or how to clean sequences like <p>&nbsp;</p>?

HTML Clean plug-in for jQuery works well for me, but I'm interested in doing the html code cleaning at server side, not in the client side.

Or, what is the replaceAll expression to do??:

String cleanS = dirtyS.replaceAll("<p>&nbsp;</p>", ""); //This doesnt work

I have discovered that the dirty html comes with mixed sequences of blank spaces #160, and others like #32.

So, what I need is a expression to remove whatever mixture of them.

mix space blank

Upvotes: 4

Views: 11589

Answers (2)

ollo
ollo

Reputation: 25340

You can change the OutputSettings for this:

Example:

final String html = ...;


OutputSettings settings = new OutputSettings();
settings.escapeMode(Entities.EscapeMode.xhtml);

String cleanHtml = Jsoup.clean(html, "", Whitelist.relaxed(), settings);

This is possible with a Document parsed by Jsoup too:

Document doc = Jsoup.parse(...);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);

// ...

Edit:

Removing tags:

doc.select("p:matchesOwn((?is) )").remove();

Please note: after (?is) there's not a blank, but char #160 (= nbsp). This will remove all p-Tags whose own text is only a &nbsp;. If you want do so with all other tags, you can replace the p: with *:.

Upvotes: 8

user890904
user890904

Reputation:

If you have the document object, you can loop over the paragrap elements and remove all those that don't have text (or non white space text) in them. before checking if the text is empty, you can replace the occurrences of NBSP; with white space. Assuming your working ith UTF-8 documents the following might work for you:

public static final String NBSP_IN_UTF8 = "\u00a0"; 

Assuming you know how to get the Document object, the loop to clean is simple: select the paragraph elements and remove empty ones:

org.jsoup.nodes.Document doc= ...   //obtain your document object  
for (org.jsoup.nodes.Element element : doc.select("p")) {
    if ( !element.hasText() || element.text().replaceAll(NBSP_IN_UTF8, "").trim().equals("") ) {
       element.remove();
    }
  }

Upvotes: 1

Related Questions