Ziv Gabovitch
Ziv Gabovitch

Reputation: 775

Jsoup clean method leaves   elements

I was trying using this code to clean my text entirely from HTML elements:

Jsoup.clean(preparedText, Whitelist.none())

Unfortunately it didn't remove the   elements. I thought that it will replace it with a whitespace, the same way as it replace the · with a middle dot ("·").

Should I use another method in order to achieve this functionality?

Upvotes: 10

Views: 4838

Answers (1)

luksch
luksch

Reputation: 11712

From the Jsoup docs:

Whitelists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.

So the whitelist are concerned only with tags and attributes.   is neither a tag nor an attribute. It is simply the html encoding for a special character. If you want to translate from the encoding to normal text you may use for example the excellent apache commons lang library or use the Jsoup unescapeEntities method:

System.out.println(Parser.unescapeEntities(doc.toString(), false));

Addendum:

The translation from · to "·" already happens when you parse the html. It does not seem to have to do with the clean method.

Upvotes: 5

Related Questions