pandas and parallel computations and external library

Question

Here is my code:

import pandas as pd
from nltk.corpus import wordnet

df = pd.DataFrame({'col_1': ['desk', 'apple', 'run']})
df['synset'] = df.col_1.apply(lambda x: wordnet.synsets(x))

The above code runs fairly slow on 4 core pc with 16 GB ram. I was hoping to speed up and run it on Google Cloud instance with 24 cores and 120 GB ram. And still was running slow (maybe twice as fast as before). And Google Console was showing that only 4.1 cores are utilized.

So I am curios: does Pandas runs computations for each row in parallel? If it does, then I am guessing nltk is a bottleneck here. Can anybody confirm or correct my guesses?

P.S. The above code is just a sample, real dataframe has 100k rows.

baloo · Accepted Answer

pandas does not parallelize apply. You should define a custom function that runs on each row instead of your lambda function, then use multiprocessing to work on that and resync it with your dataframe.

def my_func(i):
    #some work with i as index
    return (i,result)
from multiprocessing import Pool
pool = Pool(24)
res=pool.imap(my_func,df.index)
for t in res:
    df.set_value(t[0],"New column",t[1])

pandas and parallel computations and external library

Answers (1)

Related Questions