Danny Friar
Danny Friar

Reputation: 393

Memory-efficient row-wise shuffle Polars

A simple row-wise shuffle in Polars with

df = df.sample(frac=1.0)

has a peak memory usage of 2x the size of the dataframe (profiling with mprof).

Is there any fast way to perform a row-wise shuffle in Polars while keeping the memory usage down as much as possible? Shuffling column by column (or a batch of columns at a time) with the same seed (or .take with random index) does the trick but is quite slow.

Upvotes: 1

Views: 2610

Answers (1)

ritchie46
ritchie46

Reputation: 14680

A shuffle is not in-place. Polars memory is often shared between columns/series/arrow.

A shuffle therefore has to allocate a new memory buffer. If we shuffle the whole DataFrame in parallel (which sample does). We allocate new buffers in parallel and write the shuffled data, hence the 2x memory usage.

Upvotes: 3

Related Questions