Reputation: 441
I have a large static binary (10GB) that doesn't change.
I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.
I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.
Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.
I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.
I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.
Any ideas?
Thanks! Daniel.
(BTW: if it matters, I'm using Python)
Upvotes: 4
Views: 222
Reputation: 5304
Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.
The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.
You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.
You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.
Upvotes: 0
Reputation: 19601
If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.
Upvotes: 0
Reputation: 19601
Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.
If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.
It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.
Upvotes: 0
Reputation: 262534
Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.
So, say you can build all 4-tuples.
Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).
Upvotes: 1