Reputation: 13
I am testing read speads on parquet files using Dask and python and I'm finding that reading the same file with pandas is significantly faster than Dask. I am looking to understand why this is and if there is a way to get equal performance,
versions all the relevant packages
print(dask.__version__)
print(pd.__version__)
print(pyarrow.__version__)
print(fastparquet.__version__)
2.6.0
0.25.2
0.15.1
0.3.2
import pandas as pd
import numpy as np
import dask.dataframe as dd
col = [str(i) for i in list(np.arange(40))]
df = pd.DataFrame(np.random.randint(0,100,size=(5000000, 4 * 10)), columns=col)
df.to_parquet('large1.parquet', engine='pyarrow')
# Wall time: 3.86 s
df.to_parquet('large2.parquet', engine='fastparquet')
# Wall time: 27.1 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
# Wall time: 5.89 s
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
# Wall time: 4.84 s
df = pd.read_parquet('large1.parquet',engine='pyarrow')
# Wall time: 503 ms
df = pd.read_parquet('large2.parquet',engine='fastparquet')
# Wall time: 4.12 s
When working with a mixed datatypes dataframe the discrepancy is larger.
dtypes: category(7), datetime64[ns](2), float64(1), int64(1), object(9)
memory usage: 973.2+ MB
# df.shape == (8575745, 20)
df.to_parquet('large1.parquet', engine='pyarrow')
# Wall time: 9.67 s
df.to_parquet('large2.parquet', engine='fastparquet')
# Wall time: 33.3 s
# read with Dask
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
# Wall time: 34.5 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
# Wall time: 1min 22s
# read with pandas
df = pd.read_parquet('large1.parquet',engine='pyarrow')
# Wall time: 8.67 s
df = pd.read_parquet('large2.parquet',engine='fastparquet')
# Wall time: 21.8 s
Upvotes: 1
Views: 2525
Reputation: 57251
My first guess is that Pandas saves Parquet datasets into a single row group, which won't allow a system like Dask to parallelize. That doesn't explain why it's slower, but it does explain why it isn't faster.
For further information I would recommend profiling. You may be interested in this document:
Upvotes: 1