Reputation: 571
{"split1": 0.33, "split2": 0.274, "split3": 0.396}
.
The dictionary values represent the share of total data among partitions. The result would be a dictionary of polars.LazyFrame
s, assuming it is possible to stream the results of the partitioning into .parquet
files.collect()
at the end.I know I can do a single pass over the data to count them, then another pass with_row_count()
for each of the partitions and select the appropriate ranges.
I believe that it should be possible to do this "on the fly" in just a single pass, just iterating over the data and throwing them into separate baskets in a set frequency to match the fractions.
What is the most efficient way to do this partitioning? I was not able to find a build-in function in Polars nor was I able to come up with any way to do this efficiently.
I am thinking of some mathematical expression based on modular arithmetics or residual division, but I can't seem to figure out the details.
Upvotes: 0
Views: 129