Reputation: 61
I have 12 parquet files in a directory with matching columns I am trying to write to a partitioned object with Polars and PyArrow. I am iterating through each file in the directory and reading it in as a LazyFrame. I am then iterating through the list of DataFrames and writing them to the partitioned object. The estimated size of each DataFrame is ~1GB, and all the DataFrames concatenated ~10GB. The process uses ~15GB RAM and completes in under an hour.
I tried to do this with the following code:
all_lazyframes: list[pl.LazyFrame] = []
for file in glob.glob(input_path):
lazyframe: pl.LazyFrame = pl.scan_parquet(file)
all_lazyframes.append(lazyframe)
dataframes: list[pl.DataFrame] = pl.collect_all(all_lazyframes)
for output in dataframes:
output.write_parquet(
output_path,
use_pyarrow=True,
pyarrow_options={"partition_cols": ["part"]},
)
The resulting partitioned object has the following structure:
partitioned_object/
part=a/
data0.parquet
data1.parquet
...
part=b/
data0.parquet
data1.parquet
...
This object is ~250GB in size. My question is, why is the partitioned object so large when the input data is only ~10GB total? Is there a more efficient way of achieving this?
Upvotes: 2
Views: 5126
Reputation: 61
I solved this by specifying the size of rows per group in the ds.write_dataset() function. The polars method currently takes longer and you cannot specify number of rows per group while using the PyArrow option.
ds.write_dataset(
data,
output_path,
format="parquet",
min_rows_per_group=1000000,
max_rows_per_group=1000000,
partitioning=ds.partitioning(pa.schema([("type", pa.string())])),
existing_data_behavior="overwrite_or_ignore"
)
Polars method takes longer and row_group_size option does not work when using PyArrow options:
output.write_parquet(
file=output_path,
use_pyarrow=True,
pyarrow_options={"partition_cols": partition_cols},
row_group_size=1000000,
)
Upvotes: 3