Stephen Nicholson
Stephen Nicholson

Reputation: 61

Writing DataFrames as partitioned parquet object in Polars with PyArrow

I have 12 parquet files in a directory with matching columns I am trying to write to a partitioned object with Polars and PyArrow. I am iterating through each file in the directory and reading it in as a LazyFrame. I am then iterating through the list of DataFrames and writing them to the partitioned object. The estimated size of each DataFrame is ~1GB, and all the DataFrames concatenated ~10GB. The process uses ~15GB RAM and completes in under an hour.

I tried to do this with the following code:

all_lazyframes: list[pl.LazyFrame] = []

for file in glob.glob(input_path):
    lazyframe: pl.LazyFrame = pl.scan_parquet(file)
    all_lazyframes.append(lazyframe)

dataframes: list[pl.DataFrame] = pl.collect_all(all_lazyframes)

for output in dataframes:
    output.write_parquet(
        output_path,
        use_pyarrow=True,
        pyarrow_options={"partition_cols": ["part"]},
    )

The resulting partitioned object has the following structure:

partitioned_object/
  part=a/
       data0.parquet
       data1.parquet
       ...
  part=b/
       data0.parquet
       data1.parquet
    ...

This object is ~250GB in size. My question is, why is the partitioned object so large when the input data is only ~10GB total? Is there a more efficient way of achieving this?

Upvotes: 2

Views: 5126

Answers (1)

Stephen Nicholson
Stephen Nicholson

Reputation: 61

I solved this by specifying the size of rows per group in the ds.write_dataset() function. The polars method currently takes longer and you cannot specify number of rows per group while using the PyArrow option.

ds.write_dataset(
        data,
        output_path,
        format="parquet",
        min_rows_per_group=1000000,
        max_rows_per_group=1000000,
        partitioning=ds.partitioning(pa.schema([("type", pa.string())])),
        existing_data_behavior="overwrite_or_ignore"
    )

Polars method takes longer and row_group_size option does not work when using PyArrow options:

    output.write_parquet(
       file=output_path,
       use_pyarrow=True,
       pyarrow_options={"partition_cols": partition_cols},
       row_group_size=1000000,
    )

Upvotes: 3

Related Questions