Jamison
Jamison

Reputation: 2348

What are some algorithms or methods for highly efficient search indexes over random data?

I need a "point of departure" to research options for highly efficient search algorithms, methods and techniques for finding random strings within a massive amount of random data. I'm just learning about this stuff, so anyone have experience with this? Here's some conditions I want to optimize for:

  1. The first idea is to minimize file size in terms of search indexes and the like - so the smallest possible index, or even better - search on the fly.
  2. The data to be searched is high amounts of entirely random data - say, random binary 0s and 1s with no perceptable pattern. Gigabytes of the stuff.
  3. Presented with an equally random search string, say 0111010100000101010101 what is the most efficient way to locate that same string within a mountain of random data? What are the tradeoffs in performance, etc?
  4. All instances of that search string need to be located, so that seems like an important condition that limits the types of solutions to be implemented.

Any hints, clues, techniques, wiki articles etc. would be greatly appreciated! I'm just studying this now, and it seems interesting. Thanks.

Upvotes: 1

Views: 319

Answers (1)

usr
usr

Reputation: 171218

A simple way to do this is to build an index on all possible N-byte substrings of the searchable data (with N = 4 or 8 or something like that). The index would map from the small chunk to all locations where that chunk occurs.

When you want to lookup a value, take the first N bytes and use them to find all possible locations. You need to verify all locations of course.

A high value for N means more index space usage and faster lookups because less false positives will be found.

Such an index is likely to be a small multiple of the base data in size.


A second way would be to split the searchable data into contiguous, non-overlapping chunks of N bytes (N = 64 or so). Hash each chunk down to a smaller size M (M = 4 or 8 or so).

This saves a lot of index space because you don't need all the overlapping chunks.

When you lookup a value you can locate the candidate matches by looking up all contiguous, overlapping substrings of the string to be found. This assumes that the string to be found is at least N * 2 bytes in size.

Upvotes: 2

Related Questions