Reputation: 460
I have a function that I'd like to parallelize.
import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool
cores=mp.cpu_count()
# create the multiprocessing pool
pool = Pool(cores)
def clean_preprocess(text):
"""
Given a string of text, the function:
1. Remove all punctuations and numbers and converts texts to lower case
2. Handles negation words defined above.
3. Tokenies words that are of more than length 1
"""
cores=mp.cpu_count()
pool = Pool(cores)
lower = re.sub(r'[^a-zA-Z\s\']', "", text).lower()
lower_neg_handled = n_pattern.sub(lambda x: n_dict[x.group()], lower)
letters_only = re.sub(r'[^a-zA-Z\s]', "", lower_neg_handled)
words = [i for i in tok.tokenize(letters_only) if len(i) > 1] ##parallelize this?
return (' '.join(words))
I have been reading the documentations on multiprocessing but am still a little confused on how to parallelize my function appropriately. I will be grateful if somebody could point me in the right direction in parallelizing a function like mine.
Upvotes: 1
Views: 67
Reputation: 1967
On your function, you could decide to parallelize by splitting the text in sub-parts, apply the tokenization to the subparts, then join results.
Something along the line of:
text0 = text[:len(text)/2]
text1 = text[len(text)/2:]
Then apply your processing to these two parts, using:
# here, I suppose that clean_preprocess is the sequential version,
# and we manage the pool outside of it
with Pool(2) as p:
words0, words1 = pool.map(clean_preprocess, [text0, text1])
words = words1 + words2
# or continue with words0 words1 to save the cost of joining the lists
However, your function seems memory bound, so it won't have a a terrible acceleration (typically a factor 2 is the max we can hope for on standard computers these days), see e.g. How much does parallelization help the performance if the program is memory-bound? or What do the terms "CPU bound" and "I/O bound" mean?
So you could try to split the text in more than 2 parts, but may not get any faster. You could even get disappointing performance, because splitting the text could be more expensive than processing it.
Upvotes: 1