How do I improve the performance in parallel computing with dask

Question

I have a pandas dataframe and converted to dask dataframe

df.shape = (60893, 2)

df2.shape = (7254909, 2)

df['name_clean'] = df['Name'].apply(lambda x :re.sub('\W+','',x).lower(),meta=('x', 'str'))
names = df['name_clean'].drop_duplicates().values.compute()

df2['found'] = df2['name_clean2'].apply(lambda x: any(name in x for name in names),meta=('x','str')) ~ takes 834 ms

df2.head(10) ~ takes 3 min 54 sec

How can I see the shape of dask dataframe ?

Why it is so much time for .head() ? Am I doing it in the right way ?

MRocklin · Accepted Answer

You can not iterate over a dask.dataframe or dask.array. You need to call the .compute() method to turn it into a Pandas dataframe/series or NumPy array first.

Note just calling the .compute() method and then forgetting the result doesn't do anything. You need to save the result as a variable.

dask_series = df.Name.apply(lambda x: re.sub('\W+', '', x).lower(), 
                            meta=('x', 'str')
pandas_series = dask_series.compute()

for name in pandas_series:
    ...

How do I improve the performance in parallel computing with dask

Answers (1)

Related Questions