Reputation: 2479
I have file ~ 1.5GB I need to find in this file 3 billion sequences of bytes. One sequence may be 4 or 5 bytes. Find the first position, or to make sure that such a sequence in the file no. How to do it fastest?
RAM limit on computer - 4GB
Upvotes: 0
Views: 697
Reputation: 1
Check out the Searchlight search engine.
This program allows multiple sequences of up to 10 ASCII bytes to be stored within a single file. You then point it at a file, directory, file of filenames, file of directory names, arraylist of filenames or an arraylist of directory names and away it goes!!
Furthermore, it reports the file byte position/offset of each sequence found.
Upvotes: 0
Reputation: 33545
Use Preprocessing.
I think you should just create an Index
, make a run through the file, recording the first instance of every unique 4 byte sequence. Store the 4 byte sequence and the first occurring position in a different file, sorted by the byte sequence.
Using a simple binary search on the Index file will efficiently find your sequence.
You could be more clever and use hashing to reduce the search to O(1).
Upvotes: 0
Reputation: 522332
Use grep
. It's highly optimized for finding things in large files.
If that's not an option, read about the Boyer-Moore algorithm it uses and implement it yourself. It'll take a lot of tweaking to reproduce the same speed grep
has though.
Upvotes: 1