Reputation: 127
I'm reading a file which contains unicode escape sequence among the text, here the example:
\u201c@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkup\u201d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\u2764\u2764\u2764
When I'm reading it with a BufferedReader
and write it back to another file with FileWriter
the text become like this :
“@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkupâ€ðŸ˜ðŸ˜ðŸ˜ðŸ˜â¤â¤â¤
due, to the UTF-8 encoding, but what I want to have is:
“@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkup”😍😍😍😍❤❤❤
My question is, how to read and write correctly lines of text, in order to have printed the rights characters ?
I'dont' modify the lines of text, it's just a problem of conversion between unicode and utf-8 here's my code:
FileReader fileReader = new FileReader("tweets.json");
BufferedReader bufferedReader = new BufferedReader(fileReader);
File tmp = new File("out.txt");
FileWriter fileWriter = new FileWriter(tmp);
BufferedWriter bw = new BufferedWriter(fileWriter);
...
String line = bufferedReader.readLine();
bw.write(line);
Upvotes: 0
Views: 1027
Reputation: 298579
When you open a file via new FileReader("tweets.json");
, its contents gets interpreted using the system’s default encoding. When you open the target file via new BufferedWriter(fileWriter)
, the characters get encoded via the system’s default encoding again. This might look like the file gets copied as-is, but unfortunately, things are not so simple.
When the file’s actual character encoding does not match the system’s default encoding, this misinterpretation might case certain bytes to get classified as invalid, which will cause unspecified behavior, either these “characters” might get filtered out or replaced by a replacement character, which might cause garbage or even invalid characters according to the real encoding in the target file.
As Andreas correctly pointed out, the first character “
has been copied without damage, but is incorrectly displayed, because, whatever tool you used to open the file, misinterpreted the contents again as Windows-1252
. However, some of the other characters seem to be irreversible damaged (but this could also be a result of copying them to this website)…
You may either use the constructors
new InputStreamReader(new FileInputStream("tweets.json"), StandardCharsets.UTF_8)
and
new OutputStreamWriter(new FileOutputStream(tmp), StandardCharsets.UTF_8)
to interpret an UTF-8
file correctly or, better, just copy the file without interpreting its contents:
Files.copy(Paths.get("tweets.json"), Paths.get("out.txt"));
or, if you really want to do the copying loop manually
try(FileChannel in =FileChannel.open(Paths.get("tweets.json"),READ);
FileChannel out=FileChannel.open(Paths.get("out.txt"),WRITE,CREATE,TRUNCATE_EXISTING)){
long size=in.size(), trans=out.transferFrom(in, 0, size);
for(long p=trans; p<size && trans>0; p+=trans)
trans=out.transferFrom(in, p, size-p);
}
(assuming you do a import static java.nio.file.StandardOpenOption.*;
)
If you copy the files this way, you ensure that no damage occurs. Then you may focus on using an editor which reads them using the right encoding when opening the copy.
Upvotes: 1
Reputation: 159260
The Unicode character “
(\u201c
) is encoded to UTF-8
as:
\xE2\x80\x9C
Which in Windows-1252
looks like:
“
So your problem is not that the Java code isn't generating UTF-8
, because it is, but that whatever tool you use to view the file content is reading it in Windows-1252
.
If you use a program like NotePad++, you can change the encoding used, by selecting the appropriate option on the Encoding
pull-down menu.
FYI: Windows-1252
/ ISO 8859-1
doesn't support smileys, so you can't use that.
Upvotes: 0