BestPractices
BestPractices

Reputation: 12876

How to read in files into a UTF-8 Java app where the files are in different character encodings?

My application is set up to support storing UTF-8 character encodings. I am reading files that I get from various other organizations which might be in UTF-8, latin-1, ASCII, etc. Do I need to do anything special to ensure that the files which have various character encodings are read into UTF-8 format correctly? e.g. do I need to figure out what character encoding the file is in and explicitly convert it to UTF-8?

Or is the following sufficient?

Reader reader = new InputStreamReader(new FileInputStream("c:/file.txt"), "UTF-8");

Upvotes: 0

Views: 967

Answers (2)

Sebastian Negraszus
Sebastian Negraszus

Reputation: 12215

You need to tell the reader the encoding of the file.

If your input can be in many different encodings, then you might have a problem: You cannot reliably detect an encoding, see How can I detect the encoding/codepage of a text file

When you want to support different encodings, you basically have three options:

  • Store information about the encoding somewhere, such as <?xml version="1.0" encoding="UTF-8" ?> in XML files. Unfortunately, not all file formats – such as "plain text" files – have such meta data.
  • "Detect"/guess the encoding with various heuristics. This might sometimes go wrong.
  • Ask the user. This is terrible user experience, because most users have absolutely no clue what encodings even are.

Upvotes: 2

jtahlborn
jtahlborn

Reputation: 53694

You have that wrong. You don't read into an encoding, you read from encoding. The encoding you provide as the second argument to InputStreamReader should be the expected encoding of the source stream(file).

Reader reader = new InputStreamReader(new FileInputStream("c:/file.txt"), "<encoding_of_file.txt>");

Once the data is in memory, it is always UTF-16. When you want to write the data (assuming you always want to write it as UTF-8), then you will use:

Writer writer = new OutputStreamWriter(new FileOutputStream("destfile"), "UTF-8");

Upvotes: 6

Related Questions