Pablo Ruiz Ruiz
Pablo Ruiz Ruiz

Reputation: 636

Dask worse performance than Pandas

I am running the same functionality using Pandas API and Dask API. I expected Dask API to be faster, but it is not.

Functionality

I cross joint 2 dataframes (pandas and dask respectively) by a 'grouping' column and then, on every single pair I compute the Levensthein distance between 2 strings.
The results is the expected, but I am concerned about the performance.

Pandas

@timeit
def pd_fuzzy_comparison(df1:DF, df2:DF, group_col:str, compare):

    df = df1.merge(df2, on=bubble_col, suffixes=('_internal', '_external'))

    df['score'] = df.apply(lambda d: 
            comp(d.company_norm_internal, d.company_norm_external), axis=1)
    return df

Dask

@timeit
def dd_fuzzy_comparison(dd1:dd, dd2:dd, group_col:str, compare):

    ddf = dd1.merge(dd2, on='city_norm', suffixes=('_internal', '_external'))   
    ddf['score'] = ddf.apply(
            lambda d: ratio(d.company_norm_internal, d.company_norm_external), axis=1)

    return ddf.compute()

Main

import multiprocessing
CORES = multiprocessing.cpu_count()

results = pd_fuzzy_comparison(
                df1=internal_bubbles.copy(), 
                df2=external_bubbles.copy(), 
                bubble_col='city_norm',
                compare=ratio ) 

ddata1 = dd.from_pandas(internal_bubbles.copy(), npartitions=CORES)
ddata2 = dd.from_pandas(external_bubbles.copy(), npartitions=CORES)

ddresults = dd_fuzzy_comparison(
                dd1=ddata1.copy(), dd2=ddata2.copy(), 
                bubble_col='city-norm',                 
                compare=ratio)

Output

'pd_fuzzy_comparison'  1122.39 ms
'dd_fuzzy_comparison'  1717.83 ms

What am I missing?
Thanks!

Upvotes: 0

Views: 1317

Answers (1)

MRocklin
MRocklin

Reputation: 57319

First, Dask isn't always faster than Pandas. If Pandas works for you, you should stick with it.

https://docs.dask.org/en/latest/best-practices.html#start-small

In your particular case you're using the df.apply method, which uses Python for loops, which will be slowish in any case. It is also GIL bound, so you will want to choose a scheduler that uses processes, like the dask.distributed or multiprocessing schedulers.

https://docs.dask.org/en/latest/scheduling.html

Upvotes: 1

Related Questions