Reputation: 3059
I want to preserve html entities while using JSoup. Here is an utf-8 test string from a website:
String html = "<html><body>hello — world</body></html>";
String parsed = Jsoup.parse(html).toString();
If printing the parsed output in utf-8, it looks like the sequence — gets transformed into a character with a code point value of 151.
Is there a way to have JSoup preserve the original entity when outputting as utf-8? If I output in ascii encoding:
Document.OutputSettings settings = new Document.OutputSettings();
settings.charset(Charset.forName("ascii"));
Jsoup.parse(html).outputSettings(settings).toString();
I'll get:
hello — world
which is what I'm looking for.
Upvotes: 6
Views: 1246
Reputation: 43033
You have hitted a missing feature of Jsoup (as of this writing Jsoup 1.8.3).
I can see three options:
Send a request for feature on https://github.com/jhy/jsoup I'm not sure you'll get added soon...
Use the workaround provided in this SO answer: https://stackoverflow.com/a/34493022/363573
Write a custom NodeVisitor
that will turn character with a code point value back to their HTML equivalent escape sequence.
Upvotes: 2