Reputation: 2348
I need a "point of departure" to research options for highly efficient search algorithms, methods and techniques for finding random strings within a massive amount of random data. I'm just learning about this stuff, so anyone have experience with this? Here's some conditions I want to optimize for:
Any hints, clues, techniques, wiki articles etc. would be greatly appreciated! I'm just studying this now, and it seems interesting. Thanks.
Upvotes: 1
Views: 319
Reputation: 171218
A simple way to do this is to build an index on all possible N-byte substrings of the searchable data (with N = 4 or 8 or something like that). The index would map from the small chunk to all locations where that chunk occurs.
When you want to lookup a value, take the first N bytes and use them to find all possible locations. You need to verify all locations of course.
A high value for N means more index space usage and faster lookups because less false positives will be found.
Such an index is likely to be a small multiple of the base data in size.
A second way would be to split the searchable data into contiguous, non-overlapping chunks of N bytes (N = 64 or so). Hash each chunk down to a smaller size M (M = 4 or 8 or so).
This saves a lot of index space because you don't need all the overlapping chunks.
When you lookup a value you can locate the candidate matches by looking up all contiguous, overlapping substrings of the string to be found. This assumes that the string to be found is at least N * 2 bytes in size.
Upvotes: 2