wuzz
wuzz

Reputation: 392

Pythonic way to compare a list of words against a list of sentences and print the matching line

I'm currently cleaning out our database and its becoming very time consuming. The typical

for email in emails:   

loop is in nowhere even close to fast enough.

For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast

strings = ("string1", "string2", "string3")
for line in file:
    if any(s in line for s in strings):
        print "yay!"

That would never print the matching line, just the needle.

Thank you in Advance

Upvotes: 3

Views: 776

Answers (2)

Filip Młynarski
Filip Młynarski

Reputation: 3612

Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.

strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',\
         '1234', 'string2', 'string1',\
         "string1", 'abcd', 'xyz']

def compare(x, thread_idx):
    print('Thread-{} started'.format(thread_idx))
    for line in x:
        if any(s in line for s in strings):
            print("We got one of strings in line: {}".format(line))
    print('Thread-{} finished'.format(thread_idx))

Threading part:

from threading import Thread

threads = []
threads_amount = 3
chunk_size = len(lines) // threads_amount

for chunk in range(len(lines) // chunk_size):
    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
    threads[-1].start()

for i in range(threads_amount):
    threads[i].join()

Output:

Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished

Upvotes: 1

slider
slider

Reputation: 12990

One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:

emails = {"string1", "string2", "string3"} # this is a set

for line in f:
    if any(word in emails for word in line.split()):
        print("yay!")

You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.

Upvotes: 2

Related Questions