Why Dask compute faster the dataframes using from_pandas, than reading directly with dask?

Question

i Have run the same dataset in dask, in differents ways. and I found that one way is almost 10 times fastest than other!!! I try to find the reason without succes.

1. Entirely with dask

import dask.dataframe as dd
from multiprocessing import cpu_count

#Count the number of cores
cores = cpu_count()

#read and part the dataframes by the number of cores
english = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
               sep='\r', header=None, names=['ingles'], dtype={'ingles':str})
english = english.repartition(npartitions=cores)
spanish = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
              sep='\r', header=None, names=['espanol'], dtype={'espanol':str})
spanish = english.repartition(npartitions=cores)

#compute
%time total_dd = dd.merge(english, spanish, left_index=True, right_index=True).compute()

Out: 9.77 seg

2. Pandas + Dask

import pandas as pd
import dask.dataframe as dd
from multiprocessing import cpu_count

#Count the number of cores
cores = cpu_count()

#Read the Dataframe and part by the number of cores
pd_english = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
                      sep='\r', header=None, names=['ingles'])

pd_spanish = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
                      sep='\r', header=None, names=['espanol'])
english_pd = dd.from_pandas(pd_english, npartitions=cores)
spanish_pd = dd.from_pandas(pd_spanish, npartitions=cores)

#compute
%time total_pd = dd.merge(english_pd, spanish_pd, left_index=True, right_index=True).compute()

Out: 1.31 seg

Someone knows why? is there other way to do it even faster?

Valdi_Bo · Accepted Answer

Note that:

dd.read_csv(...) does not actually read anything. It is only a step of construction of the computation tree.
as late as when you run compute, the whole computing tree constructed so far is actually executed, including reading of both DataFrames.

So in the first variant the timed operation includes:

reading both DataFrames,
repartitioning them,
and finally the merge itself.

In the second variant, as far as what is timed, the situation is different. Both DataFrames have already been read before, so the timed operation includes only repartition and merge.

Apparently the source DataFrames are big and reading them takes considerable time, not accounted for in the second variant.

Try another test: Create a function which:

reads both DataFrames pd.read_csv(...)
performs the remaining steps (repartition and merge).

Then compute the execution time of this function.

I suppose, the execution time may be even longer than in the first variant, because:

in the first variant both DataFrames are read concurrently (by different cores),
in the test proposed above the are read sequentially.

Why Dask compute faster the dataframes using from_pandas, than reading directly with dask?

1. Entirely with dask

2. Pandas + Dask

Answers (1)

Related Questions