Reputation: 1418
I have tested a multiprocess and thread in python, but multiprocess is slower than thread, and I calculate a distance using editdistance, my code like:
def calc_dist(kw, trie_word):
dists = []
while len(trie_word) != 0:
w = trie_word.pop()
dist = editdistance.eval(kw, w)
dists.append((w, dist))
return dists
if __name__ == "__main__":
word_list = [str(i) for i in range(1, 10000001)]
key_word = '2'
print("calc")
s = time.time()
with Pool(processes=4) as pool:
result = pool.apply_async(calc_dist, (key_word, word_list))
print(len(result.get()))
print("用时",time.time()-s)
Using threading:
class DistThread(threading.Thread):
def __init__(self, func, args):
super(DistThread, self).__init__()
self.func = func
self.args = args
self.dists = None
def run(self):
self.dists = self.func(*self.args)
def join(self):
super().join(self)
return self.dists
In my computer, it consumes about 118s, but thread takes about 36s, where is wrong with it?
Upvotes: 0
Views: 911
Reputation:
I had a similar experience with threading and multi processing inside Python to consume CSVS that had a large amount of data. I had a small look into this and found that processing spawns multiple processes to perform tasks which can be slower than just running one threaded process since threading runs in one place. There is a more definitive answer here: Multiprocessing vs Threading Python.
Pasting answer from link incase link disappears;
The threading module uses threads
, the multiprocessing module uses processes
. The difference is that threads run in the same memory space, while processes have separate memory. This makes it a bit harder to share objects between processes with multiprocessing. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. This is what the global interpreter lock is for.
Spawning processes is a bit slower than spawning threads. Once they are running, there is not much difference.
Upvotes: 0
Reputation: 16174
a couple of issues:
a significant amount of time will be spent serialising the data so it can be sent to the other process while threads share the same address space so pointers can be used
your current code is only using one process to do all the calcs with multiprocessing. you need to seperate your array into "chunks" somehow so that it can be processed via multiple workers
e.g:
import time
from multiprocessing import Pool
import editdistance
def calc_one(trie_word):
return editdistance.eval(key_word, trie_word)
if __name__ == "__main__":
word_list = [str(i) for i in range(1, 10000001)]
key_word = '2'
print("calc")
s = time.time()
with Pool(processes=4) as pool:
result = pool.map(calc_one, word_list, chunksize=10000)
print(len(result))
print("time",time.time()-s)
s = time.time()
result = list(calc_one(w) for w in word_list)
print(len(result))
print("time",time.time()-s)
this relies on key_word
being a global variable. for me, the version using multiple processes takes ~5.3 seconds while the second version takes ~16.9 secs. not 4 times as quick as the data still needs to be sent back and forth, but pretty good
Upvotes: 3