Java - get line from Random access file based on offsets

Question

I have a very large (11GB) .json file (yeah, whoever thought that a great idea?) that I need to sample (read k random lines).

I'm not very savvy in Java file IO but I have, of course, found this post: How to get a random line of a text file in Java?

I'm dropping the accepted answer because it's clearly way too slow to read every single line of an 11GB file just to select one (or rather k) out of the about 100k lines.

Fortunately, there is a second suggestion posted there that I think might be of better use to me:

Use RandomAccessFile to seek to a random byte position in the file.

Seek left and right to the next line terminator. Let L the line between them.

With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.

So far so good, but I was wondering about that "let L be the line between them".

I would have done something like this (untested):

RandomAccessFile raf = ...
long pos = ...
String line = getLine(raf,pos);
...

where

private String getLine(RandomAccessFile raf, long start) throws IOException{
    long pos = (start % 2 == 0) ? start : start -1;
    
    if(pos == 0) return raf.readLine();
    
    do{
        pos -= 2;
        raf.seek(pos);
    }while(pos > 0 && raf.readChar() != '
');

    pos = (pos <= 0) ? 0 : pos + 2;
    raf.seek(pos);
    return raf.readLine();
}

and then operated with line.length(), which forgoes the need to explicitly seek the right end of the line.

So why "seek left and right to the next line terminator"? Is there a more convenient way to get the line from these two offsets?

Andy Turner · Accepted Answer

It looks like this would do approximately the same - raf.readLine() is seeking right to the next line terminator; it's just doing it for you.

One thing to note is that RandomAccessFile.readLine() doesn't support reading unicode strings from the file:

Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

Demo of the incorrect reading:

import java.io.*;
import java.nio.charset.StandardCharsets;

class Demo {
  public static void main(String[] args) throws IOException {
    try (FileOutputStream fos = new FileOutputStream("output.txt");
         OutputStreamWriter osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
         BufferedWriter writer = new BufferedWriter(osw)) {
      writer.write("ⵉⵎⴰⵣⵉⵖⵏ");
    }

    try (RandomAccessFile raf = new RandomAccessFile("output.txt", "r")) {
      System.out.println(raf.readLine());
    }
  }
}

Output:

âµâµâ´°âµ£âµâµâµ

But output.txt does contain the correct data:

$ cat output.txt
ⵉⵎⴰⵣⵉⵖⵏ

As such, you might want to do the seeking yourself, or explicitly convert the result of raf.readLine() to the correct charset:

String line = new String(
    raf.readLine().getBytes(StandardCharsets.ISO_8859_1),      
    StandardCharsets.UTF_8);

Java - get line from Random access file based on offsets

Answers (1)

Related Questions