Reputation: 4547
I have a text file with a strange encoding "UCS-2 Little Endian" that I want to read its contents using Java.
As you can see in the above screenshot the file contents appear fine in Notepad++, but when i read it using this code, just garbage is being printed in the console:
String textFilePath = "c:\strange_file_encoding.txt"
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF8" ) );
String line = "";
while ( ( line = reader.readLine() ) != null ) {
System.out.println( line ); // Prints garbage characters
}
The main point is that the user selects the file to read, so it can be of any encoding, and since I can't detect the file encoding I decode it using "UTF8" but as in the above example it fails to read it right.
Is there away to read such strange files in a right way ? Or at least can i detect if my code will fail to read it right ?
Upvotes: 7
Views: 12660
Reputation: 20376
You cannot use UTF-8 encoding for all files, especially if you do not know which file encoding to expect. Use a library which can detect the file encoding before your read the file, for example: juniversalchardet or jChardet
For more info see Java : How to determine the correct charset encoding of a stream
Upvotes: 1
Reputation: 387
You are using UTF-8 as your encoding in the InputStreamReader constructor, so it will try to interpret the bytes as UTF-8 instead of UCS-LE. Here is the documentation: Charset
I suppose you need to use UTF-16LE according to it.
Here is more info on the supported character sets and their Java names: Supported Encodings
Upvotes: 7
Reputation: 95518
You're providing the wrong encoding in InputStreamReader
. Have you tried using UTF-16LE instead if UTF8?
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF-16LE" ) );
According to Charset
:
UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order
Upvotes: 1