DanielBK
DanielBK

Reputation: 942

Why does RandomAccessFile read  as firt character in my UTF-8 text file?

A question on reading text files in Java. I have a text file saved with UTF-8 encoding with only the content:

Hello. World.

Now I am using a RandomAccessFile to read this class. But for some reason, there seems to be an "invisible" character at the beginning of the file ...?

I use this code:

File file = new File("resources/texts/books/testfile2.txt");
try(RandomAccessFile reader = new RandomAccessFile(file, "r")) {

    String readLine = reader.readLine();
    String utf8Line = new String(readLine.getBytes("ISO-8859-1"), "UTF-8" );
    System.out.println("Read Line: " + readLine);
    System.out.println("Real length: " + readLine.length());
    System.out.println("UTF-8 Line: " + utf8Line);
    System.out.println("UTF-8 length: " + utf8Line.length());
    System.out.println("Current position: " + reader.getFilePointer());
} catch (Exception e) {
    e.printStackTrace();
}

The output is this:

Read Line: ?»?Hello. World.
Real length: 16
UTF-8 Line: ?Hello. World.
UTF-8 length: 14
Current position: 16

These (1 or 2) characters seem to appear only at the very beginning. If I add more lines to the file and read them, then all the further lines are being read normally. Can someone explain this behavior? What is this character at the beginning?

Thanks!

Upvotes: 2

Views: 342

Answers (1)

MarianD
MarianD

Reputation: 14151

The first 3 bytes in your file (0xef, 0xbb, 0xbf) is so called UTF-8 BOM (Byte Order Mark). BOM is important for UTF-16 and UTF-32 only - for UTF-8 it has no meaning. Microsoft introduced it for the better guess of the file encoding.

That is, no all UTF-8 encoded text files have that mark, but some have.

Upvotes: 3

Related Questions