Reputation: 961
I use a RandomAccessFile
object to read an UTF-8 French file. I use the readLine
method.
My Groovy code below:
while ((line = randomAccess.readLine())) {
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
++count
long nextRecordPos = randomAccess.getFilePointer()
compareNextRecords(utfLine, randomAccess)
randomAccess.seek(nextRecordPos)
}
My problem is utfLine
and line
are the same: the accented characters stay like é instead of é. No conversion is done.
Upvotes: 1
Views: 639
Reputation: 18532
First of all, this line of code does absolutely nothing. The data is the same. Remove it:
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
According to the Javadoc, RandomAccessFile.readLine()
is not aware of character encodings. It reads bytes until it encounters "\r" or "\n" or "\r\n". ASCII byte values are put into the returned string in the normal way. But byte values between 128 and 255 are put into the string literally without interpreting it as being in a character encoding (or you could say this is the raw/verbatim encoding).
There is no method or constructor to set the character encoding in a RandomAccessFile
. But it's still valuable to use readLine()
because it takes care of parsing for a newline sequence and allocating memory.
The easiest solution in your situation is to manually convert the fake "line" into bytes by reversing what readLine()
did, then decode the bytes into a real string with awareness of character encoding. I don't know how to write code in Groovy, so I'll give the answer in Java:
String fakeLine = randomAccess.readLine();
byte[] bytes = new byte[fakeLine.length()];
for (int i = 0; i < fakeLine.length(); i++)
bytes[i] = (byte)fakeLine.charAt(i);
String realLine = new String(bytes, "UTF-8");
Upvotes: 3