user517196
user517196

Reputation: 43

Lookup Hash from LONG List of Hashes

I have a long text file that contains around 100 million MD5 hashes. I would like to hash a small set of files and find out if any of them have a hash value that is on the 100 million hash list. My 100 million hashes are alphabetically sorted. Without having to load the entire list into memory or into a database, what would be the most efficient way to look up hash values from this large text file? The hash list will be updated occasionally but will remain alphabetically sorted. Not interested in the location of the found hit. What matters is whether or not there is a hit.

Upvotes: 4

Views: 604

Answers (3)

Thomas Pornin
Thomas Pornin

Reputation: 74382

The critical parameter in that kind of job is the cost of an individual disk lookup. A disk lookup has an innate latency, because the read/write heads must be moved to the right position. On a typical disk, you can count on about a hundred lookups per second. On the other hand, disks are very good at sequential reading, so for each lookup, you can read, say, one megabyte worth of data, for little extra cost.

I here assume that the "text file" has a regular format. For instance, each hash value uses exactly 33 bytes, 32 for the MD5 result itself (in hexadecimal) and 1 extra byte for a "newline" character. Adjust if needed, depending on the exact format. With these figures, your text file has length about 3.3 GB.

Since MD5 acts mostly like a random function, the 100 million hashes should be uniformly spread in the space of 128-bit values. This means that, given a hash value, you can compute the approximate position of that value in the file (if it is in the file). For instance, the hash value 9378ec093d09863d008154f1c8f5ca8f should be at an offset close to 0.5761*n*33, where n is the number of hashes in the big file and "33" is explained in the paragraph above. 0.5761 is the result of 0x9378EC divided by 0x1000000. Hence you can read one megabyte worth of your text file, centered on that computed position. This will contain about 30000 hashes. The standard deviation for 100 million random values is on the order of 10000, so chances are high that the 30000 hashes will contain the right values to decide whether your hash is in the list or not. If the estimate was off, you will have to read another megabyte, but this will not happen often. Possibly, you could read a bit more than a megabyte to make that occurrence rare: there is a trade-off, which is to be adjusted through actual measures.

Once you have a (small) block of hash values in RAM, use a binary search. But the initial lookup cost will completely dwarf out that part anyway.

An alternate solution uses an extra index file. Build a secondary file which contains one every 10000 hashes in the big file. This file will have length about 330 kB. Keep this file in RAM as much as possible. Use it (with a binary search) to know which sequence of 10000 hashes is relevant for your lookup. Then read that chunk from the big file. The index file must be rebuilt whenever the list of hashes changes; this is kind of expensive, but less than the actual big file change. Depending on the system which produces the big file, you may perhaps integrate the index file generation for a negligible extra cost.

Upvotes: 4

Simone
Simone

Reputation: 11797

If they are sorted, for each hash in the small set you could look up the 100 milion hash with binary search.

This is the most efficient way that comes to my mind, but if you don't want to store any value in memory you will have to random access the file.

Upvotes: 0

Ian
Ian

Reputation: 34489

I would imagine a binary search of the file would be quickest... You'd need to store the exact number of hashes in the file as a header first though so you know the limits of your search.

I've seen this done with large files, such as postcode information and it works a treat.

Upvotes: 2

Related Questions