Parallelism inside of a function?

Question

I have a function that counts how often a list of items appears in rows below:

def count(pair_list):
    return float(sum([1 for row in rows if all(item in row.split() for item in pair_list)]))

if __name__ == "__main__":
    pairs = [['apple', 'banana'], ['cookie', 'popsicle'], ['candy', 'cookie'], ...]
    # grocery transaction data
    rows = ['apple cookie banana popsicle wafer', 'almond milk eggs butter bread', 'bread almonds apple', 'cookie candy popsicle pop', ...]

    res = [count(pair) for pair in pairs]

In reality, len(rows) is 10000 and there are 18000 elements in pairs, so the computing cost of the list comprehension in count() and the one in the main function is expensive.

I tried some parallel processing:

from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing as mp

threadpool = ThreadPool(processes = mp.cpu_count())

res = threadpool.map(count, pairs)

This doesn't run quickly, either. In fact, after 15 minutes, I just quit the job because it didn't look to be ending. Two questions: 1) how can I speed up the actualy searching that takes place in count()? 2) how can I check the status of the threadpool.map process (i.e. see how many pairs are left to iterate over)?

Alexander Anikin · Accepted Answer

1) The overall complexity of calculations is enormous, and it comes from different sources:

a) You split row on low level of calculation, so python has to create new row split for every iteration. To avoid this, you can pre-calculate rows. Something like this will do the job (with minor changes in "count" function):

rows2 = [row.split() for row in rows]

b) You compare list items one by one, even though you only need to check existence of word in another list. Here we can tweak it more (and use rows3 instead of rows2 in "count" function):

rows3 = [set(row.split()) for row in rows]

def count(pair_list):
    return float(sum([1 for row in rows3 if all(item in row for item in pair_list)]))

c) You check every word in pairs with every word in rows. Calculation takes 2*len(row)*len(rows) iterations per call of "count" function for original version, while it can take less. For option b) it can be down to 2*len(rows) in good case, but it's possible to make one set lookup per pair, not 2. The trick is to make preparation of all possible word*word combinations for every row and check if corresponding tuple exists in this set. So, in main function you create complex immutable search structure:

rows4 = [set((a, b) for a in row for b in row) for row in rows2]

And now "count" will look different, it takes tuple instead of list:

def count2(pair):
    return float(len([1 for row in rows4 if(pair in row)]))

So you call it a bit different: res = [count2(tuple(pair)) for pair in pairs]

Note that search structure creation takes len(row.split())^2 per row in time and space, so if your row can be long, it's not optimal. After all, option b) can be better.

2) You can predict number of calls for "count" - it's len(pairs). Count calls of "count" function and make debug print in it for, say, every 1000 calls.

Parallelism inside of a function?

Answers (1)

Related Questions