Reputation: 19506
I have downloaded some plain text files from a German website but I am not sure what the encoding is. There are no byte markers in the file. I am using a parser that assumes the files are encoded in UTF8, so it is not handling certain accented characters properly (those that fall in the byte range > 127)
I would like to convert it to UTF8 but I am not sure if I will need to know the encoding to properly do this.
The way others have been handling these files is to manually open it in windows notepad and re-saving it in UTF8. This process preserves the accented characters, so I would like to automate this conversion if possible without resorting to windows notepad.
How does Windows Notepad know how to convert it to UTF8 properly?
How should I convert the file to UTF8 (in Java 6)?
Upvotes: 1
Views: 5196
Reputation: 109547
In Java 7 get the text with "Windows-1252" this is Windows Latin-1.
Path oldPath = Paths.get("C:/Temp/old.txt");
Path newPath = Paths.get("C:/Temp/new.txt");
byte[] bytes = Files.readAllBytes(oldPath);
String content = "\uFEFF" + new String(bytes, "Windows-1252");
bytes = content.getBytes("UTF-8");
Files.write(newPath, bytes, StandardOption.WRITE);
This takes the bytes, interpretes them as Windows Latin-1. And for NotePad the trick: NotePad recognizes the encoding by a preceding BOM marker character. A zero-width space, normally not used in UTF-8.
Then it takes from the String the UTF-8 encoding.
Windows-1252 is ISO-8859-1 (pure Latin-1) but has some special characters, like comma quotes, of the range 0x80 - 0xBF.
In Java 6:
File oldPath = new File("C:/Temp/old.txt");
File newPath = new File("C:/Temp/new.txt");
long longLength = oldPath.length();
if (longLength > Integer.MAX_VALUE) {
throw new IllegalArgumentException("File too large: " + oldPath.getPath());
}
int fileSize = (int)longLength;
byte[] bytes = new byte[fileSize];
InputStream in = new FileInputStream(oldPath);
int nread = in.read(bytes);
in.close();
assert nread == fileSize;
String content = "\uFEFF" + new String(bytes, "Windows-1252");
bytes = content.getBytes("UTF-8");
OutputStream out = new FileOutputStream(newPath);
out.write(bytes);
out.close();
Upvotes: 2