Ron Tuffin
Ron Tuffin

Reputation: 54619

Read unicode text files with java

Real simple question really. I need to read a Unicode text file in a Java program.

I am used to using plain ASCII text with a BufferedReader FileReader combo which is obviously not working :(

I know that I can read a String in the 'traditional' way using a Buffered Reader and then convert it using something like:

temp = new String(temp.getBytes(), "UTF-16");

But is there a way to wrap the Reader in a 'Converter'?

EDIT: the file starts with FF FE

Upvotes: 15

Views: 64579

Answers (7)

Macarse
Macarse

Reputation: 93143

Check https://docs.oracle.com/javase/1.5.0/docs/api/java/io/InputStreamReader.html.

I would read source file with something like:

Reader in = new InputStreamReader(new FileInputStream("file"), "UTF-8"));

Upvotes: 10

Jorge Ros
Jorge Ros

Reputation: 21

I just had to add "UTF-8" to the creation of the InputStreamReader and special characters could be seen inmediately.

InputStreamReader istreamReader = new InputStreamReader(inputStream,"UTF-8");
BufferedReader bufferedReader = new BufferedReader(istreamReader);

Upvotes: 0

aldo
aldo

Reputation: 1

String s = new String(Files.readAllBytes(Paths.get("file.txt")),"UTF-8");

Upvotes: -1

stenix
stenix

Reputation: 3106

I would recommend to use UnicodeReader from Google Data API, see this answer for a similar question. It will automatically detect encoding from the Byte order mark (BOM).

You may also consider BOMInputStream in Apache Commons IO which does basically the same but does not cover all alternative versions of BOM.

Upvotes: 2

daniel molla
daniel molla

Reputation: 1

 Scanner scan = new Scanner(new File("C:\\Users\\daniel\\Desktop\\Corpus.txt"));
   while(scan.hasNext()){

   System.out.println(scan.nextLine());
    }

Upvotes: -1

McDowell
McDowell

Reputation: 108889

Some notes:

  • the "UTF-16" encoding can read either little- or big-endian encoded files marked with a BOM; see here for a list of Java 6 encodings; it is not explicitly stated what endianness will be used when writing using "UTF-16" - it appears to be big-endian - so you might want to use "UnicodeLittle" when saving the data
  • be careful when using String class encode/decode methods, especially with a marked variable-width encoding like UTF-16 - use them only on whole data
  • as others have said, it is often best to read character data by wrapping your InputStream with an InputStreamReader; you can concatenate your input into a single String using a StringBuilder or similar buffer.

Upvotes: 7

objects
objects

Reputation: 8677

you wouldn't wrap the Reader, instead you would wrap the stream using an InputStreamReader. You could then wrap that with your BufferedReader that you currently use

BufferedReader in = new BufferedReader(new InputStreamReader(stream, encoding));

Upvotes: 18

Related Questions