Reputation: 73
I found variants of this idea, but none that could get me (very new to python) to where I need to be.
Here is the scenario:
hashfile.txt
consisting of unique strings all on separate lines.addresses.txt
fileoutfile.txt
My current code has been optimized to the best of my ability, but only gets around 150 lines/second. Considering I have over 1.5 billion lines in my hashfile.txt
, any optimization would help.
fin = 'hashed.txt'
nonzeros = open('addrOnly.txt', 'r')
fout = open('hits.txt', 'w')
lines = nonzeros.read()
i = 0
count = 0
with open(fin, 'r') as f:
for privkey in f:
address = privkey.split(", ")[0]
if address in lines:
fout.write(privkey)
i = i+1
if i%100 == 0:
count = count + 100
print "Passed: " + str(count)
Upvotes: 5
Views: 3950
Reputation: 40894
With data sizes like this, I'd use a proper database. Databases are optimized for fast handling of large datasets much better than a Python program one might write.
Direct string comparisons are expensive. Let's hash the strings so that the full binary tree index of hashes had a very good chance to fit in memory. md5 is 128 bit and is very fast to compute.
First, computre md5 for each record in either file, and store them in another text file:
from hashlib import md5
with open('hashfile.txt') as input:
with open('hashfile-md5.txt', 'w') as output:
for line in input:
value = line.rstrip() # cut '\n'
output.write(value)
output.write('\t') # let our file be tab-separated
output.write(int(value).hexdigest(), 16)) # md5 as long number
output.write('\n')
Repeat the same for address.txt
, generating address-md5.txt
.
Take Postgresql, mysql, or even SQLite (I'll use it here), and create two tables and one index.
$ sqlite3 matching-db.sqlite
create table hashfile (
txt varchar(64), -- adjust size to line lengths of hashfile.txt
hash number(38) -- enough to contain 128-bit hash
);
create table address (
txt varchar(64), -- adjust size to line lengths of address.txt
hash number(38) -- enough to contain 128-bit hash
);
Now load our data. Native database import is usually much faster than doing inserts from Python via dbapi.
.separator \t
.import hashfile-md5.txt hashfile
.import address-md5.txt address
Now we can create an index:
create index x_address_hash on address(hash);
This is a select
statement that will efficiently scan the large hashfile
table and look up matching hashes form the small address
table. The index will be in RAM the whole time (hopefully), as will be most of the address table.
select h.txt
from hashfile h, address a
where h.hash = a.hash and h.txt = a.txt;
The idea is that index x_address_hash
will be used to efficiently match hashes, and if hashes match, actual text values will be compared, too.
I did not try it on 29 MB of data, but on toy 2-row example it worked :)
Upvotes: 6
Reputation: 36900
What you're looking to implement is probably the Rabin-Karp string search. It's highly efficient when you're searching for multiple strings simultaneously in some corpus.
More information about python implementations in this post. python efficient substring search
Since you're searching for multiple address at once, you would probably want to hash the entries in addresses.txt
and compare them against the Rabin-Karp hash all at once, for each iteration. Read up more about the rolling hash in Rabin-Karp and you'll see how this works.
Since Rabin-Karp requires all patterns to be the same length; in practice all the addresses are probably of some non-negligible length, you can truncate them all down to the same (not too short) length and use the prefix to hash. Also, you might want to modify the Rabin-Karp hash to be invariant to whitespace and small differences in how addresses are formatted, and also define a custom string comparator similarly that will confirm matches.
Upvotes: 6