yotam.shacham
yotam.shacham

Reputation: 307

JSoup character encoding issue #2

I'm constructing a JSoup document like this:

String user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/A.B     (KHTML, like Gecko) Chrome/X.Y.Z.W Safari/A.B.";
String url = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC24391/?tool=pubmed";
Document doc = return Jsoup.connect(url).userAgent(user_agent).get();

Then, I save to file using doc.toString() and in the saved file I see characters replaced by ?. For example 5 μm will become 5 ?m.

If I change output settings to use ISO-8859-1 charset, it seems ok.

Can anyone explain why this is? From my understanding the original html page is UTF-8, which is the default Jsoup encoding..

Upvotes: 2

Views: 2171

Answers (1)

BalusC
BalusC

Reputation: 1108672

Works fine for me. Your problem is caused elsewhere.

Most probable cause is that you didn't save the file using UTF-8. You should be using OutputStreamWriter to write characters in a specified character encoding to the file.

writer = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");

Also, you need to make sure that the file viewer or whatever process you use after saving the file is also using UTF-8 throughout the entire pipeline. See also Unicode - How to get the characters right?

Upvotes: 3

Related Questions