Reputation: 983
I would like to benefit dask repartition
feature, but the requested size is not fulfilled, and smaller files are produced.
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd
file = 'example.parquet'
file_res_dd = 'example_res'
# Generate a random df and write it down as an input data file.
df = pd.DataFrame(np.random.randint(100,size=(100000, 20)),columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])
table = pa.Table.from_pandas(df)
pq.write_table(table, file, version='2.0')
# Read back with dask, repartition, and write it down.
dd_df = dd.read_parquet(file, engine='pyarrow')
dd_df = dd_df.repartition(partition_size='1MB')
dd_df.to_parquet(file_res_dd, engine='pyarrow')
With this example, I am expecting files with size about 1MB.
Input file that is written 1st is about 1,7MB, so I am expecting 2 files at most.
But in the example_res
folder that is created, I get 9 files being ~270kB.
Why is that so?
Thanks for your help! Bests,
Upvotes: 0
Views: 266
Reputation: 28683
The "partition size" is of the in-memory representation, and only an approximation.
Parquet offers various encoding and compression options that generally result in a file that is a good deal smaller - but how much smaller will depend greatly on the specific data in question.
Upvotes: 1