pierrefevrier
pierrefevrier

Reputation: 1640

Search for a string in large file and save it's position in Java

I'm searching for a way to parse large files (about 5-10Go) and search for position (in byte) of some recurrent strings, the fastest as possible.

I've tried to use the RandomAccessFile reader by doing something like bellow:

RandomAccessFile lecteurFichier = new RandomAccessFile(<MyFile>, "r");
while (currentPointeurPosition < lecteurFichier.length()) {
     char currentFileChar = (char) lecteurFichier.readByte();
     // Test each char for matching my string (by appending chars until I found my string)
     // and keep a trace of all found string's position
}

The problem is this code is too slow (maybe because I read byte by byte ?).

I also tried the solution bellow, which is perfect in term of speedness but I can't get my string's positions.

    FileInputStream is = new FileInputStream(fichier.getFile());

    FileChannel f = is.getChannel();

    ByteBuffer buf = ByteBuffer.allocateDirect(64 * 1024);

    Charset charset = Charset.forName("ISO-8859-1");
    CharsetDecoder decoder = charset.newDecoder();

    long len = 0;
    while ((len = f.read(buf)) != -1) {
        buf.flip();

        String data = "";
        try {
            int old_position = buf.position();
            data = decoder.decode(buf).toString();
            // reset buffer's position to its original so it is not altered:
            buf.position(old_position);
        }
        catch (Exception e) {
            e.printStackTrace();
        }

        buf.clear();
    }

    f.close();

Does anyone has a better solution to propose ?

Thank you in advance (and sorry for my spelling, I'm french)

Upvotes: 2

Views: 3974

Answers (2)

Stephen C
Stephen C

Reputation: 718788

Since your input data is encoded in an 8-bit encoding*, you can speed up the search by encoding the search string rather than decoding the file:

byte[] encoded = searchString.getBytes("ISO-8859-1");

BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int b;
long pos = -1;
while ((b = bis.read()) != -1) {
    pos++;
    if (encoded[0] == b) {
       // see if rest of string matches
    }
}

A BufferedInputStream should be pretty fast. Using ByteBuffer might be faster, but this is going to make the search logic more complicated because of the possibility of a string match than spans a buffer boundary.

Then there are various clever ways to optimize string searches that could be adapted to this situation ... where you are search a stream of bytes / characters rather than an array of bytes / characters. The Wikipedia page on String Searching is a good place to start.

Note that since we are reading and matching in a byte-wise fashion, the position is just the count of bytes read (or skipped), so there is no need to use a random access file.


* In fact this trick will work with many multibyte encodings too.

Upvotes: 1

Arvind
Arvind

Reputation: 478

Searching for a 'needle' in a 'haystack' is a well-studied problem-Here's a related link on StackOverflow itself. I am sure the java implementations of the algorithms discussed should be available too. Why not try some of them,to see if they fit the job?

Upvotes: 0

Related Questions