Redis: Identifying keys that occur more than once with minimal RAM usage

Question

I'm working on an application that wants to analyze ~one billion 250 byte keys to identify the subset of those keys that occur more than once in the dataset.

The catch is not all keys fit in main memory at once, so I'm wondering: is there an efficient algorithm or fuzzy data structure that can identify the keys that are likely to contain more than one value?

My current plan is to use a kind of modified Bloom filter - I hash each key, then store that hash as a pointer to an integer in Redis. The first time we see a hash, set its value to 1, then increment each time the hash is seen thereafter. At the end, only keys whose hashes have a value > 1 should enter into Redis. Is there a better way to identify keys that occur more than once? I'd be very grateful for any suggestions others can offer!

LSerni · Accepted Answer

I'd try a brute force option. Read the whole set and separate it into 65536 different files based on the first two bytes of each key, if sufficiently random, or of its hash, if not. (You can actually use more than two bytes).

So key 0a18abad1dea... goes into file ./0a/18/0a18.dat . The whole operation takes approximately another 250 gigabytes.

^{To optimize file opens/writes, you may want to keep in memory 65536 buckets with keys, and flush them periodically rather than doing the file open/append/close for each new key. Each gigabyte of RAM allows extra 50 keys of size for each bucket.}

At the end, you'll have 65536 files, each holding around (one billion/65536) = 15258 250-byte keys. On each of these files you run a reorder or a uniqueness check. Working with multiple cores, this again takes the same time as re-reading the whole dataset a second time. This second part can also be offloaded to separate machines, each managing a set of files of its own.

Redis: Identifying keys that occur more than once with minimal RAM usage

Answers (1)

Related Questions