Parallelizing function in a for loop

Question

I have a function that I'd like to parallelize.

import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool

cores=mp.cpu_count()

# create the multiprocessing pool
pool = Pool(cores)

def clean_preprocess(text):
    """
    Given a string of text, the function:
    1. Remove all punctuations and numbers and converts texts to lower case
    2. Handles negation words defined above.
    3. Tokenies words that are of more than length 1
    """
    cores=mp.cpu_count()
    pool = Pool(cores)
    lower = re.sub(r'[^a-zA-Z\s\']', "", text).lower()
    lower_neg_handled = n_pattern.sub(lambda x: n_dict[x.group()], lower)
    letters_only = re.sub(r'[^a-zA-Z\s]', "", lower_neg_handled)
    words = [i for i  in tok.tokenize(letters_only) if len(i) > 1] ##parallelize this? 
return (' '.join(words))

I have been reading the documentations on multiprocessing but am still a little confused on how to parallelize my function appropriately. I will be grateful if somebody could point me in the right direction in parallelizing a function like mine.

Demi-Lune · Accepted Answer

On your function, you could decide to parallelize by splitting the text in sub-parts, apply the tokenization to the subparts, then join results.

Something along the line of:

text0 = text[:len(text)/2]
text1 = text[len(text)/2:]

Then apply your processing to these two parts, using:

# here, I suppose that clean_preprocess is the sequential version, 
# and we manage the pool outside of it
with Pool(2) as p:
  words0, words1 = pool.map(clean_preprocess, [text0, text1])
words = words1 + words2
# or continue with words0 words1 to save the cost of joining the lists

However, your function seems memory bound, so it won't have a a terrible acceleration (typically a factor 2 is the max we can hope for on standard computers these days), see e.g. How much does parallelization help the performance if the program is memory-bound? or What do the terms "CPU bound" and "I/O bound" mean?

So you could try to split the text in more than 2 parts, but may not get any faster. You could even get disappointing performance, because splitting the text could be more expensive than processing it.

Parallelizing function in a for loop

Answers (1)

Related Questions