Avenger
Avenger

Reputation: 441

How many times string appears in another string

I have a large static binary (10GB) that doesn't change.

I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.

I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.

Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.

I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.

I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.

Any ideas?

Thanks! Daniel.

(BTW: if it matters, I'm using Python)

Upvotes: 4

Views: 222

Answers (4)

Nuclearman
Nuclearman

Reputation: 5304

Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.

The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.

You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.

  • To reduce the number of filters:
    1. Randomly delete filters until there are only so many left. This will likely increase the false negative rate, which probably means it's better to simply delete the filters with the highest expected false positive rates.
    2. Randomly merge filters until there are only so many left. Ideally avoiding merging a filter too often as it increases the false positive rate. Practically speaking, you probably have too many to do this without making use of the scalable version (see below), as it'll probably be hard enough to manage the false positive rate.
    3. It also may not be a bad to avoid a greedy approach when adding to a bloom filter. Be rather selective in which filter something is added to.

You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.

Upvotes: 0

mcdowella
mcdowella

Reputation: 19601

If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.

Upvotes: 0

mcdowella
mcdowella

Reputation: 19601

Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.

If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.

It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.

Upvotes: 0

Thilo
Thilo

Reputation: 262534

Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.

So, say you can build all 4-tuples.

Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).

Upvotes: 1

Related Questions