Georg Heiler
Georg Heiler

Reputation: 17674

vectorized shuffling per row in pandas

I want to shuffle the columns of a pandas data frame. However, the default method (sample) shuffles all the columns in the same way.

How can I efficiently shuffle the columns of each row differently?

import pandas as pd

df = pd.DataFrame({'foo':[1,4,7],'bar':[2,5,8],'baz':[3,6,9],})
display(df)
df.sample(frac=1, axis=1)

Certainly, an apply based solution would work - but this would not be vectorized and thus slow.

enter image description here

Is there a fast (and ideally vectorized) way to sample differently for each row?

Upvotes: 1

Views: 130

Answers (3)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Let us try with np.random.rand and argsort to generate shuffled indices

i = np.random.rand(*df.shape).argsort(1)
df.values[:] = np.take_along_axis(df.to_numpy(), i, axis=1)

print(df)

   foo  bar  baz
0    3    1    2
1    4    5    6
2    7    9    8

Upvotes: 2

MSS
MSS

Reputation: 3633

You can try this solution:

def shuffle_columns_per_row(df):
    arr = df.values
    x, y = arr.shape
    rows = np.indices((x,y))[0]
    cols = [np.random.permutation(y) for _ in range(x)]
    return pd.DataFrame(arr[rows, cols], columns=df.columns)

| foo | bar | baz |
|-----|-----|-----|
| 3   | 2   | 1   |
| 5   | 6   | 4   |
| 9   | 7   | 8   |

Upvotes: 1

Albo
Albo

Reputation: 1644

A quick check gives the benchmark:

%%timeit
df.sample(frac=1, axis=1) 
# 1000 loops, best of 5: 288 µs per loop

With apply, as you said, we get:

%%timeit
idx = np.random.choice([0, 1, 2], size=(3,), replace=False)
df.apply(lambda x: x.iloc[idx], axis=1)
# 1000 loops, best of 5: 1.47 ms per loop -> ~3700 times slower

We could rather use iloc:

%%timeit
idx = np.random.choice([0, 1, 2], size=(3,), replace=False)
df.iloc[:, idx]
# 1000 loops, best of 5: 398 µs per loop -> ~1.4 times slower

If you could live with a roughly 1.4 times decrease in speed, I think the iloc version would work.

Upvotes: 0

Related Questions