Roman Shmandrovskyi
Roman Shmandrovskyi

Reputation: 973

How to read file with saving encoding?

So, I have file in ISO8859-1 encoding. I do the next:

InputStreamReader isr = new InputStreamReader(new FileInputStream(fileLocation));
System.out.println(isr.getEncoding());

And I get UTF8... Looks like FileInputStream or InputStreamReader convert it to UTF8.

Yes, I know about the next one way:

BufferedReader br = new BufferedReader(
     new InputStreamReader(
     new FileInputStream(fileLocation), "ISO-8859-1");

But I don't know beforehand what encoding my file will have.

How can I read file with saving encoding?

Upvotes: 1

Views: 114

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109547

Binary files (bytes) that are actually text in some encoding for those bytes, unfortunately do not store the encoding (charset) somewhere.

Sometimes there is an encoding somewhere: Unicode text could have an optional BOM character at the begin of the file. HTML and XML can specify the charset.

If you downloaded the file from the internet in the header lines the charset could be mentioned. Say it were an HTML file, and Content-Type: text/html; charset=Windows-1251. Then you could read the file with Windows-1251, and always store it as UTF-8, modifying/adding a <meta charset="UTF-8">.

But in general there is no solution for determining some file's encoding. You could do:

  • read the bytes
  • if convertible to UTF-8 without error in the multibyte sequences, it is UTF-8
  • otherwise it is a single byte encoding, default to Windows-1252 (rather than ISO-8859-1)
  • maybe use word frequency tables of some languages together with encodings, and try those
  • write the bytes in the determined encoding to file as UTF-8

There might be a library doing such a thing; combining language recognition and charset recognition.

Upvotes: 2

Related Questions