Reputation: 392
I'm currently cleaning out our database and its becoming very time consuming. The typical
for email in emails:
loop is in nowhere even close to fast enough.
For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast
strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"
That would never print the matching line, just the needle.
Thank you in Advance
Upvotes: 3
Views: 776
Reputation: 3612
Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare()
by amount threads that we declare.
strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',\
'1234', 'string2', 'string1',\
"string1", 'abcd', 'xyz']
def compare(x, thread_idx):
print('Thread-{} started'.format(thread_idx))
for line in x:
if any(s in line for s in strings):
print("We got one of strings in line: {}".format(line))
print('Thread-{} finished'.format(thread_idx))
Threading part:
from threading import Thread
threads = []
threads_amount = 3
chunk_size = len(lines) // threads_amount
for chunk in range(len(lines) // chunk_size):
threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
threads[-1].start()
for i in range(threads_amount):
threads[i].join()
Output:
Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished
Upvotes: 1
Reputation: 12990
One possibility is to use a set
to store emails. This makes the the check if word in emails
O(1). So work done is proportional to the total number of words in your file:
emails = {"string1", "string2", "string3"} # this is a set
for line in f:
if any(word in emails for word in line.split()):
print("yay!")
You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set
.
Upvotes: 2