user31641
user31641

Reputation: 175

Search for a string from 100 million rows of strings

I have this text file containing some md5 hashes, 100 million rows of them. I have this another smaller file with few thousand md5 hashes. I want to find the corresponding indices of these md5 hashes from this new smaller file to the old bigger file.

what is the most efficient way to do it? Is it possible to do it in like 15 mins or so?

I have tried lots of things but they do not work. First I tried to import the bigger data to a database file and create an index on the md5 hash column. Creating this hash takes for ever. I am not even sure if this will increase the query speed much. Suggestions?

Upvotes: 1

Views: 3069

Answers (4)

A. I. Breveleri
A. I. Breveleri

Reputation: 767

Assumptions:

(1) every record in the small file appears in the large file

(2) the data in each file is randomly ordered.

Options:

(1) For each record in the large file, search the small file linearly for a match. Since most searches will not find a match, the time will be close to Nlarge * Nsmall * k where k represents the time to attempt one match.

(2) For each record in the small file, search the large file linearly for a match. Since every search will find a match, the time will be about Nlarge/2 * Nsmall * k.

This looks twice as fast as option (1) -- but only if you can fit the large file completely into fast memory. You would probably need 6 GB of RAM.

(3) Sort the small file into an easily searchable form. A balanced binary tree is best, but a sorted array is almost as good. Or you could trust the author of some convenient hash table object to have paid attention in CS school. For each record in the large file, search the structured small file for a match. The time will be log2 Nsmall * s to sort the small file, where s represents the time to sort one record, plus log2 Nsmall * Nlarge * k for the scan. This gives a total time of log2 Nsmall * (s + Nlarge * k).

(4) Sort the large file into an easily searchable form. For each record in the small file, search the structured large file for a match. The time will be log2 Nlarge * s to sort the large file plus log2 Nlarge * Nsmall * k for the scan, giving a total of log2 Nlarge * (s + Nsmall * k).

Option (4) is obviously the fastest, as reducing any coefficient of Nlarge dominates all other improvements. But if the sortable structure derived from the large file will not fit completely into RAM, then option (3) might turn out to be faster.

(5) Sort the large file into an easily searchable form. Break this structure into pieces that will fit into your RAM. For each such piece, load the piece into RAM, then for each record in the small file, search the currently loaded piece for a match. The time will be log2 Nlarge * s to sort the large file plus log2 Nlarge * Nsmall * k * p for the scan, where the structure was broken into p pieces, giving a total of log2 Nlarge * (s + Nsmall * k * p).

With the values you indicated for Nlarge and Nsmall, and enough RAM so that p can be kept to a single digit, option (5) seems likely to be the fastest.

Upvotes: 0

Clément
Clément

Reputation: 12917

There are algorithms specifically designed for searching for multiple strings in a large file. One of them is Rabin-Karp. I have a blog post about this.

More simply, the following pseudo-code should get you there in no time :

Load your few thousand strings in a set data structure
For each line (index: i) in your file
    If that line appears in your set of values
        print i

This will be very fast: The set data structure will have almost-instant lookups, so the IO will the culprit, and 100 million hashsums will fit in 15 minutes without too much difficulty.

Upvotes: 0

Ebbe M. Pedersen
Ebbe M. Pedersen

Reputation: 7488

Don't do this in db - use a simple program.

  1. Read the md5 hashes from the small file into a hash map in memory, that allow for fast look-ups.
  2. Then read through the md5's in the big file one row at a time, and check if the row is in the hash map.

Average look-up time in the hash map ought to be close to O(1), so the process time of this is basically how fast you can read through the big file.

The 15 minutes is easily obtained with today's hardware with this approach.

Upvotes: 3

Eugen Rieck
Eugen Rieck

Reputation: 65274

First of all: 100 Megarows à 32 Bytes = ca. 3.2 GByte of data. Reading them in 15 Minutes translates to 3.5 Megabytes per second, which should easily be doable with modern hardware.

I recommend not to use a database, but process consisting of some easy steps:

  1. Sort your data - you have to do this only once, and you can parallelize much of it
  2. Read the small file into memory (sorted into an array)
  3. Cycle this array:
  4. Read the big file line by line, comparing with the current line of your array (first compar e first byte, then first and second, ...) until you either reach a match (output index) or pass the value (output "not found")
  5. Move to next array element

The initial sort might easily take longer than 15 minutes, but the lookups should be quite fast: Ify you have enough RAM (and an OS that supports processes bigger than 2GB) you should be able to get a compare rate at least an order of magnitude faster!

Upvotes: 1

Related Questions