MxLDevs
MxLDevs

Reputation: 19506

Converting an ANSI file with German characters to UTF8

I have downloaded some plain text files from a German website but I am not sure what the encoding is. There are no byte markers in the file. I am using a parser that assumes the files are encoded in UTF8, so it is not handling certain accented characters properly (those that fall in the byte range > 127)

I would like to convert it to UTF8 but I am not sure if I will need to know the encoding to properly do this.

The way others have been handling these files is to manually open it in windows notepad and re-saving it in UTF8. This process preserves the accented characters, so I would like to automate this conversion if possible without resorting to windows notepad.

How does Windows Notepad know how to convert it to UTF8 properly?
How should I convert the file to UTF8 (in Java 6)?

Upvotes: 1

Views: 5196

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109547

In Java 7 get the text with "Windows-1252" this is Windows Latin-1.

Path oldPath = Paths.get("C:/Temp/old.txt");
Path newPath = Paths.get("C:/Temp/new.txt");
byte[] bytes = Files.readAllBytes(oldPath);
String content = "\uFEFF" + new String(bytes, "Windows-1252");
bytes = content.getBytes("UTF-8");
Files.write(newPath, bytes, StandardOption.WRITE);

This takes the bytes, interpretes them as Windows Latin-1. And for NotePad the trick: NotePad recognizes the encoding by a preceding BOM marker character. A zero-width space, normally not used in UTF-8.

Then it takes from the String the UTF-8 encoding.

Windows-1252 is ISO-8859-1 (pure Latin-1) but has some special characters, like comma quotes, of the range 0x80 - 0xBF.


In Java 6:

File oldPath = new File("C:/Temp/old.txt");
File newPath = new File("C:/Temp/new.txt");
long longLength = oldPath.length();
if (longLength > Integer.MAX_VALUE) {
    throw new IllegalArgumentException("File too large: " + oldPath.getPath());
}
int fileSize = (int)longLength;
byte[] bytes = new byte[fileSize];
InputStream in = new FileInputStream(oldPath);
int nread = in.read(bytes);
in.close();
assert nread == fileSize;

String content = "\uFEFF" + new String(bytes, "Windows-1252");
bytes = content.getBytes("UTF-8");

OutputStream out = new FileOutputStream(newPath);
out.write(bytes);
out.close();

Upvotes: 2

Related Questions