Reputation: 3975
I have something like
Whitelist whitelist = new Whitelist();
whitelist.addTags("p", "i", "b", "em", "strong", "u");
String content = Jsoup.clean(data.html(), whitelist);
in my code. But the Jsoup library removes " and '. How do I prevent that.
e.g. = <p>It's a sunny day.</p>
result = It? s a sunny day.
Upvotes: 0
Views: 2022
Reputation:
You are using data.html() . here is what the API of Element class tells about it: Element API
Retrieves the element's inner HTML. E.g. on a <div> with one empty <p>, would return <p></p>. (Whereas Node.outerHtml() would return <div><p></p></div>.)
so you should be using the method outerHtml() instead:
String content = Jsoup.clean(data.outerHtml(), whitelist);
here is also another link for useful examples. the example contains both methods and you can see the difference: Jsoup Attribute text and HTML example
As for the other issue (quote being turned into question mark), I think its a matter of encoding and charachter set as it is not happening on my pc. check the encoding of the source html file and try to initially parse it in Jsoup with the matching charachter set.
Upvotes: 4