turbanoff
turbanoff

Reputation: 2479

search 4-5 bytes sequence in big file

I have file ~ 1.5GB I need to find in this file 3 billion sequences of bytes. One sequence may be 4 or 5 bytes. Find the first position, or to make sure that such a sequence in the file no. How to do it fastest?

RAM limit on computer - 4GB

Upvotes: 0

Views: 697

Answers (3)

Mark
Mark

Reputation: 1

Check out the Searchlight search engine.

This program allows multiple sequences of up to 10 ASCII bytes to be stored within a single file. You then point it at a file, directory, file of filenames, file of directory names, arraylist of filenames or an arraylist of directory names and away it goes!!

Furthermore, it reports the file byte position/offset of each sequence found.

Upvotes: 0

st0le
st0le

Reputation: 33545

Use Preprocessing.

I think you should just create an Index, make a run through the file, recording the first instance of every unique 4 byte sequence. Store the 4 byte sequence and the first occurring position in a different file, sorted by the byte sequence.

Using a simple binary search on the Index file will efficiently find your sequence.

You could be more clever and use hashing to reduce the search to O(1).

Upvotes: 0

deceze
deceze

Reputation: 522332

Use grep. It's highly optimized for finding things in large files.
If that's not an option, read about the Boyer-Moore algorithm it uses and implement it yourself. It'll take a lot of tweaking to reproduce the same speed grep has though.

Upvotes: 1

Related Questions