Reading unicode lines from files conversion UTF-8

Question

I'm reading a file which contains unicode escape sequence among the text, here the example:

\u201c@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkup\u201d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\u2764\u2764\u2764

When I'm reading it with a BufferedReader and write it back to another file with FileWriter the text become like this :

â€œ@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkupâ€ðŸ˜ðŸ˜ðŸ˜ðŸ˜â¤â¤â¤

due, to the UTF-8 encoding, but what I want to have is:

“@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkup”😍😍😍😍❤❤❤

My question is, how to read and write correctly lines of text, in order to have printed the rights characters ?

I'dont' modify the lines of text, it's just a problem of conversion between unicode and utf-8 here's my code:

FileReader fileReader = new FileReader("tweets.json");
BufferedReader bufferedReader = new BufferedReader(fileReader);
File tmp = new File("out.txt");
FileWriter fileWriter = new FileWriter(tmp);
BufferedWriter bw = new BufferedWriter(fileWriter);
...
String line = bufferedReader.readLine();
bw.write(line);

Andreas · Accepted Answer

The Unicode character “ (\u201c) is encoded to UTF-8 as:

\xE2\x80\x9C

Which in Windows-1252 looks like:

â€œ

So your problem is not that the Java code isn't generating UTF-8, because it is, but that whatever tool you use to view the file content is reading it in Windows-1252.

If you use a program like NotePad++, you can change the encoding used, by selecting the appropriate option on the Encoding pull-down menu.

FYI: Windows-1252 / ISO 8859-1 doesn't support smileys, so you can't use that.

Reading unicode lines from files conversion UTF-8

Answers (2)

Related Questions