Reputation: 73

Searching for string in massive files efficiently

I found variants of this idea, but none that could get me (very new to python) to where I need to be.

Here is the scenario:

I have one massive, 27 gig hashfile.txt consisting of unique strings all on separate lines.
I need to parse this file line by line searching for a match in another, not-so-large (~800mb) addresses.txt file
When a match is found, this needs to be written to outfile.txt

My current code has been optimized to the best of my ability, but only gets around 150 lines/second. Considering I have over 1.5 billion lines in my hashfile.txt, any optimization would help.

fin = 'hashed.txt'
nonzeros = open('addrOnly.txt', 'r')
fout = open('hits.txt', 'w')
lines = nonzeros.read()
i = 0
count = 0

with open(fin, 'r') as f:
    for privkey in f:
            address = privkey.split(", ")[0]
            if address in lines:
                    fout.write(privkey)
            i = i+1
            if i%100 == 0:
                    count = count + 100
                    print "Passed: " + str(count)

Upvotes: 5

Answers (2)

9000

Reputation: 40894

With data sizes like this, I'd use a proper database. Databases are optimized for fast handling of large datasets much better than a Python program one might write.

Direct string comparisons are expensive. Let's hash the strings so that the full binary tree index of hashes had a very good chance to fit in memory. md5 is 128 bit and is very fast to compute.

First, computre md5 for each record in either file, and store them in another text file:

from hashlib import md5
with open('hashfile.txt') as input:
  with open('hashfile-md5.txt', 'w') as output:
    for line in input:
      value = line.rstrip() # cut '\n'
      output.write(value)
      output.write('\t') # let our file be tab-separated
      output.write(int(value).hexdigest(), 16)) # md5 as long number
      output.write('\n')

Repeat the same for address.txt, generating address-md5.txt.

Take Postgresql, mysql, or even SQLite (I'll use it here), and create two tables and one index.

$ sqlite3 matching-db.sqlite

create table hashfile (
  txt varchar(64), -- adjust size to line lengths of hashfile.txt
  hash number(38) -- enough to contain 128-bit hash
);

create table address (
  txt varchar(64), -- adjust size to line lengths of address.txt
  hash number(38) -- enough to contain 128-bit hash
);

Now load our data. Native database import is usually much faster than doing inserts from Python via dbapi.

.separator \t
.import hashfile-md5.txt hashfile
.import address-md5.txt address

Now we can create an index:

create index x_address_hash on address(hash);

This is a select statement that will efficiently scan the large hashfile table and look up matching hashes form the small address table. The index will be in RAM the whole time (hopefully), as will be most of the address table.

select h.txt
from hashfile h, address a
where h.hash = a.hash and h.txt = a.txt;

The idea is that index x_address_hash will be used to efficiently match hashes, and if hashes match, actual text values will be compared, too.

I did not try it on 29 MB of data, but on toy 2-row example it worked :)

Upvotes: 6

Andrew Mao

Reputation: 36900

What you're looking to implement is probably the Rabin-Karp string search. It's highly efficient when you're searching for multiple strings simultaneously in some corpus.

More information about python implementations in this post. python efficient substring search

Since you're searching for multiple address at once, you would probably want to hash the entries in addresses.txt and compare them against the Rabin-Karp hash all at once, for each iteration. Read up more about the rolling hash in Rabin-Karp and you'll see how this works.

Since Rabin-Karp requires all patterns to be the same length; in practice all the addresses are probably of some non-negligible length, you can truncate them all down to the same (not too short) length and use the prefix to hash. Also, you might want to modify the Rabin-Karp hash to be invariant to whitespace and small differences in how addresses are formatted, and also define a custom string comparator similarly that will confirm matches.

Upvotes: 6

Searching for string in massive files efficiently

Answers (2)

Related Questions