Reputation: 7730
Here is my code:
import pandas as pd
from nltk.corpus import wordnet
df = pd.DataFrame({'col_1': ['desk', 'apple', 'run']})
df['synset'] = df.col_1.apply(lambda x: wordnet.synsets(x))
The above code runs fairly slow on 4 core pc with 16 GB ram. I was hoping to speed up and run it on Google Cloud instance with 24 cores and 120 GB ram. And still was running slow (maybe twice as fast as before). And Google Console was showing that only 4.1 cores are utilized.
So I am curios: does Pandas runs computations for each row in parallel? If it does, then I am guessing nltk
is a bottleneck here. Can anybody confirm or correct my guesses?
P.S. The above code is just a sample, real dataframe has 100k rows.
Upvotes: 0
Views: 166
Reputation: 527
pandas does not parallelize apply. You should define a custom function that runs on each row instead of your lambda function, then use multiprocessing
to work on that and resync it with your dataframe.
def my_func(i):
#some work with i as index
return (i,result)
from multiprocessing import Pool
pool = Pool(24)
res=pool.imap(my_func,df.index)
for t in res:
df.set_value(t[0],"New column",t[1])
Upvotes: 1