How to share a variable (not for modification) in python using multithread?

Question

I'm trying to parallelize a function I wrote for sequential program. Below is the input and output

Input 1, list of string : ["foo bar los angles", "foo bar new york", ...]

Input 2, list of string as dictionary: ["los angles", "new york"..]

I want to remove all string in input 2 from input 1. So the output will be like:

["foo bar", "foo bar"].

I'm able to do it using a double for loop.

res = []
for s1 in input1:
    for s2 in input2:
        if s2 in s1:
            res.append(s1.replace(s2, ""))

But this run a little slow (more than 10 minutes on my macbook pro) on 2 million size of list input1 (input 2 is couple of thousands).

I found a way to use python's multithreading.dummy.Pool. And use pool.map along with a global variable to parallelize it. But I'm concern about the usage of global variable. Is it safe to do so? Is there a better way to for python multithread to share a variable (May be like apache spark's mapPartions)?

I'm using Python 2.7 now. So I'd prefer answer use python2.

rrauenza · Accepted Answer

It's generally recommended to avoid multithreading when wanting performance because of the GIL. Luckily we have multiprocessing!

#!/usr/bin/python
import itertools
import multiprocessing

in1 = ["foo bar los angles", "foo bar new york",]
in2 = ["los angles", "new york",]

results = []

def sub(arg):
    s1, s2 = arg
    if s2 in s1:
        return s1.replace(s2, "")

pool = multiprocessing.Pool(4)
for result in pool.imap(sub, itertools.product(in1, in2)):
    if result is not None:
        results.append(result)

print results

It sounds like your 2 million item list is already in memory, so you'll want to use imap not map in order to keep from turning the product into a thousands of millions item list. I also use itertools.product to do the cartesian product of your inputs -- which is what your nested loop was doing.

Your requirements were a little vague in terms of uniqueness -- you were only adding to the results if you found a match.

Since we only add to results in the main body, there is no need to worry about the global results variable. If you were using multithreading your map function could write directly to the results variable because of the GIL's protection ....but your concurrency would also be suffering from the GIL as well.

Note you can tune the imap by passing a large chunksize. You can optimize further by relaxing the ordered requirement by using imap_unordered. See multiprocessing for more information.

How to share a variable (not for modification) in python using multithread?

Answers (1)

Related Questions