Reputation: 1428
I'm trying to parallelize a function I wrote for sequential program. Below is the input and output
Input 1, list of string : ["foo bar los angles", "foo bar new york", ...]
Input 2, list of string as dictionary: ["los angles", "new york"..]
I want to remove all string in input 2 from input 1. So the output will be like:
["foo bar", "foo bar"].
I'm able to do it using a double for loop.
res = []
for s1 in input1:
for s2 in input2:
if s2 in s1:
res.append(s1.replace(s2, ""))
But this run a little slow (more than 10 minutes on my macbook pro) on 2 million size of list input1 (input 2 is couple of thousands).
I found a way to use python's multithreading.dummy.Pool
. And use pool.map
along with a global variable to parallelize it. But I'm concern about the usage of global variable. Is it safe to do so? Is there a better way to for python multithread to share a variable (May be like apache spark's mapPartions
)?
I'm using Python 2.7 now. So I'd prefer answer use python2.
Upvotes: 2
Views: 91
Reputation: 6993
It's generally recommended to avoid multithreading when wanting performance because of the GIL. Luckily we have multiprocessing!
#!/usr/bin/python
import itertools
import multiprocessing
in1 = ["foo bar los angles", "foo bar new york",]
in2 = ["los angles", "new york",]
results = []
def sub(arg):
s1, s2 = arg
if s2 in s1:
return s1.replace(s2, "")
pool = multiprocessing.Pool(4)
for result in pool.imap(sub, itertools.product(in1, in2)):
if result is not None:
results.append(result)
print results
It sounds like your 2 million item list is already in memory, so you'll want to use imap
not map
in order to keep from turning the product into a thousands of millions item list. I also use itertools.product
to do the cartesian product of your inputs -- which is what your nested loop was doing.
Your requirements were a little vague in terms of uniqueness -- you were only adding to the results if you found a match.
Since we only add to results
in the main body, there is no need to worry about the global results
variable. If you were using multithreading
your map
function could write directly to the results variable because of the GIL's protection ....but your concurrency would also be suffering from the GIL as well.
Note you can tune the imap
by passing a large chunksize
. You can optimize further by relaxing the ordered requirement by using imap_unordered
. See multiprocessing for more information.
Upvotes: 2