Reputation: 153
i Have run the same dataset in dask, in differents ways. and I found that one way is almost 10 times fastest than other!!! I try to find the reason without succes.
import dask.dataframe as dd
from multiprocessing import cpu_count
#Count the number of cores
cores = cpu_count()
#read and part the dataframes by the number of cores
english = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
sep='\r', header=None, names=['ingles'], dtype={'ingles':str})
english = english.repartition(npartitions=cores)
spanish = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
sep='\r', header=None, names=['espanol'], dtype={'espanol':str})
spanish = english.repartition(npartitions=cores)
#compute
%time total_dd = dd.merge(english, spanish, left_index=True, right_index=True).compute()
Out: 9.77 seg
import pandas as pd
import dask.dataframe as dd
from multiprocessing import cpu_count
#Count the number of cores
cores = cpu_count()
#Read the Dataframe and part by the number of cores
pd_english = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
sep='\r', header=None, names=['ingles'])
pd_spanish = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
sep='\r', header=None, names=['espanol'])
english_pd = dd.from_pandas(pd_english, npartitions=cores)
spanish_pd = dd.from_pandas(pd_spanish, npartitions=cores)
#compute
%time total_pd = dd.merge(english_pd, spanish_pd, left_index=True, right_index=True).compute()
Out: 1.31 seg
Someone knows why? is there other way to do it even faster?
Upvotes: 1
Views: 563
Reputation: 30971
Note that:
So in the first variant the timed operation includes:
In the second variant, as far as what is timed, the situation is different. Both DataFrames have already been read before, so the timed operation includes only repartition and merge.
Apparently the source DataFrames are big and reading them takes considerable time, not accounted for in the second variant.
Try another test: Create a function which:
Then compute the execution time of this function.
I suppose, the execution time may be even longer than in the first variant, because:
Upvotes: 2