Reputation: 307
I'm constructing a JSoup document like this:
String user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/A.B (KHTML, like Gecko) Chrome/X.Y.Z.W Safari/A.B.";
String url = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC24391/?tool=pubmed";
Document doc = return Jsoup.connect(url).userAgent(user_agent).get();
Then, I save to file using doc.toString()
and in the saved file I see characters replaced by ?
. For example 5 μm
will become 5 ?m
.
If I change output settings to use ISO-8859-1 charset, it seems ok.
Can anyone explain why this is? From my understanding the original html page is UTF-8, which is the default Jsoup encoding..
Upvotes: 2
Views: 2171
Reputation: 1108672
Works fine for me. Your problem is caused elsewhere.
Most probable cause is that you didn't save the file using UTF-8. You should be using OutputStreamWriter
to write characters in a specified character encoding to the file.
writer = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
Also, you need to make sure that the file viewer or whatever process you use after saving the file is also using UTF-8 throughout the entire pipeline. See also Unicode - How to get the characters right?
Upvotes: 3