Reputation: 393
A simple row-wise shuffle in Polars with
df = df.sample(frac=1.0)
has a peak memory usage of 2x the size of the dataframe (profiling with mprof).
Is there any fast way to perform a row-wise shuffle in Polars while keeping the memory usage down as much as possible? Shuffling column by column (or a batch of columns at a time) with the same seed (or .take
with random index) does the trick but is quite slow.
Upvotes: 1
Views: 2610
Reputation: 14680
A shuffle is not in-place
. Polars memory is often shared between columns/series/arrow.
A shuffle therefore has to allocate a new memory buffer. If we shuffle the whole DataFrame
in parallel (which sample
does). We allocate new buffers in parallel and write the shuffled data, hence the 2x memory usage.
Upvotes: 3