Reputation: 636
I am running the same functionality using Pandas API and Dask API. I expected Dask API to be faster, but it is not.
Functionality
I cross joint 2 dataframes (pandas and dask respectively) by a 'grouping' column and then, on every single pair I compute the Levensthein distance between 2 strings.
The results is the expected, but I am concerned about the performance.
Pandas
@timeit
def pd_fuzzy_comparison(df1:DF, df2:DF, group_col:str, compare):
df = df1.merge(df2, on=bubble_col, suffixes=('_internal', '_external'))
df['score'] = df.apply(lambda d:
comp(d.company_norm_internal, d.company_norm_external), axis=1)
return df
Dask
@timeit
def dd_fuzzy_comparison(dd1:dd, dd2:dd, group_col:str, compare):
ddf = dd1.merge(dd2, on='city_norm', suffixes=('_internal', '_external'))
ddf['score'] = ddf.apply(
lambda d: ratio(d.company_norm_internal, d.company_norm_external), axis=1)
return ddf.compute()
Main
import multiprocessing
CORES = multiprocessing.cpu_count()
results = pd_fuzzy_comparison(
df1=internal_bubbles.copy(),
df2=external_bubbles.copy(),
bubble_col='city_norm',
compare=ratio )
ddata1 = dd.from_pandas(internal_bubbles.copy(), npartitions=CORES)
ddata2 = dd.from_pandas(external_bubbles.copy(), npartitions=CORES)
ddresults = dd_fuzzy_comparison(
dd1=ddata1.copy(), dd2=ddata2.copy(),
bubble_col='city-norm',
compare=ratio)
Output
'pd_fuzzy_comparison' 1122.39 ms
'dd_fuzzy_comparison' 1717.83 ms
What am I missing?
Thanks!
Upvotes: 0
Views: 1317
Reputation: 57319
First, Dask isn't always faster than Pandas. If Pandas works for you, you should stick with it.
https://docs.dask.org/en/latest/best-practices.html#start-small
In your particular case you're using the df.apply
method, which uses Python for loops, which will be slowish in any case. It is also GIL bound, so you will want to choose a scheduler that uses processes, like the dask.distributed or multiprocessing schedulers.
https://docs.dask.org/en/latest/scheduling.html
Upvotes: 1