ChriX
ChriX

Reputation: 961

RandomAccesFile and UTF8 line

I use a RandomAccessFile object to read an UTF-8 French file. I use the readLine method.

My Groovy code below:

while ((line = randomAccess.readLine())) {
    def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
    ++count
    long nextRecordPos = randomAccess.getFilePointer()

    compareNextRecords(utfLine, randomAccess)

    randomAccess.seek(nextRecordPos)
}

My problem is utfLine and line are the same: the accented characters stay like é instead of é. No conversion is done.

Upvotes: 1

Views: 639

Answers (1)

Nayuki
Nayuki

Reputation: 18532

First of all, this line of code does absolutely nothing. The data is the same. Remove it:

def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')

According to the Javadoc, RandomAccessFile.readLine() is not aware of character encodings. It reads bytes until it encounters "\r" or "\n" or "\r\n". ASCII byte values are put into the returned string in the normal way. But byte values between 128 and 255 are put into the string literally without interpreting it as being in a character encoding (or you could say this is the raw/verbatim encoding).

There is no method or constructor to set the character encoding in a RandomAccessFile. But it's still valuable to use readLine() because it takes care of parsing for a newline sequence and allocating memory.

The easiest solution in your situation is to manually convert the fake "line" into bytes by reversing what readLine() did, then decode the bytes into a real string with awareness of character encoding. I don't know how to write code in Groovy, so I'll give the answer in Java:

String fakeLine = randomAccess.readLine();
byte[] bytes = new byte[fakeLine.length()];
for (int i = 0; i < fakeLine.length(); i++)
    bytes[i] = (byte)fakeLine.charAt(i);
String realLine = new String(bytes, "UTF-8");

Upvotes: 3

Related Questions