Adam Bajger
Adam Bajger

Reputation: 571

Streaming partitioning into partitions of set size in Polars

Situation

  1. I have a parquet dataset of big size, that does not fit into RAM.
  2. I need to split the dataset into multiple partitions defined by a dictionary such as {"split1": 0.33, "split2": 0.274, "split3": 0.396}. The dictionary values represent the share of total data among partitions. The result would be a dictionary of polars.LazyFrames, assuming it is possible to stream the results of the partitioning into .parquet files.
  3. I need this to be as fast as possible.
  4. I want to use Polars to do this, in lazy evaluation mode, streaming collect() at the end.

Suboptimal solution

I know I can do a single pass over the data to count them, then another pass with_row_count() for each of the partitions and select the appropriate ranges.

Hypothesis

I believe that it should be possible to do this "on the fly" in just a single pass, just iterating over the data and throwing them into separate baskets in a set frequency to match the fractions.

Actual question

What is the most efficient way to do this partitioning? I was not able to find a build-in function in Polars nor was I able to come up with any way to do this efficiently.

Comments

I am thinking of some mathematical expression based on modular arithmetics or residual division, but I can't seem to figure out the details.

Upvotes: 0

Views: 129

Answers (0)

Related Questions