Parallelism by multiprocessing is barely reducing time

Question

I used this and this to run 2 function calls in parallel, but the times are barely improving. This is my code:

Sequential:

from nltk import pos_tag

def posify(txt):
    return ' '.join([pair[1] for pair in pos_tag(txt.split())])

df1['pos'] = df1['txt'].apply(posify)  # ~15 seconds
df2['pos'] = df2['txt'].apply(posify)  # ~15 seconds
# Total Time: 30 seconds

Parallel:

from nltk import pos_tag
import multiprocessing

def posify(txt):
    return ' '.join([pair[1] for pair in pos_tag(txt.split())])

def posify_parallel(ser, key_name, shared_dict):
    shared_dict[key_name] = ser.apply(posify)

manager = multiprocessing.Manager()
return_dict = manager.dict()
p1 = multiprocessing.Process(target=posify_parallel, args=(df1['txt'], 'df1', return_dict))
p1.start()
p2 = multiprocessing.Process(target=posify_parallel, args=(df2['txt'], 'df2', return_dict))
p2.start()
p1.join(), p2.join()
df1['pos'] = return_dict['df1']
df2['pos'] = return_dict['df2']
# Total Time: 27 seconds

I would expect the total time to be about 15 seconds, but I'm getting 27 seconds.
If it makes any difference, I have an i7 2.6GHz CPU with 6 cores (12 logical).

Is it possible to achieve something around 15 seconds? Does this have something to do with the pos_tag function itself?

EDIT:

I ended up just doing the following and now it's 15 seconds:

with Pool(cpu_count()) as pool:
    df1['pos'] = pool.map(posify, df1['txt'])
    df2['pos'] = pool.map(posify, df2['txt'])

I think this way the lines run sequentially, but each of them runs in parallel internally. As long as it's 15 seconds, that's fine with me.

Parallelism by multiprocessing is barely reducing time

Answers (1)

Related Questions