Repartition by size with dask not producing files with expected size

Question

I would like to benefit dask repartition feature, but the requested size is not fulfilled, and smaller files are produced.

import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd

file = 'example.parquet'
file_res_dd = 'example_res'

# Generate a random df and write it down as an input data file.
df = pd.DataFrame(np.random.randint(100,size=(100000, 20)),columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])
table = pa.Table.from_pandas(df)
pq.write_table(table, file, version='2.0')

# Read back with dask, repartition, and write it down.
dd_df = dd.read_parquet(file, engine='pyarrow')
dd_df = dd_df.repartition(partition_size='1MB')
dd_df.to_parquet(file_res_dd, engine='pyarrow')

With this example, I am expecting files with size about 1MB. Input file that is written 1st is about 1,7MB, so I am expecting 2 files at most. But in the example_res folder that is created, I get 9 files being ~270kB.

Why is that so?

Thanks for your help! Bests,

mdurant · Accepted Answer

The "partition size" is of the in-memory representation, and only an approximation.

Parquet offers various encoding and compression options that generally result in a file that is a good deal smaller - but how much smaller will depend greatly on the specific data in question.

Repartition by size with dask not producing files with expected size

Answers (1)

Related Questions