Reputation: 17674
I want to shuffle the columns of a pandas data frame. However, the default method (sample) shuffles all the columns in the same way.
How can I efficiently shuffle the columns of each row differently?
import pandas as pd
df = pd.DataFrame({'foo':[1,4,7],'bar':[2,5,8],'baz':[3,6,9],})
display(df)
df.sample(frac=1, axis=1)
Certainly, an apply
based solution would work - but this would not be vectorized and thus slow.
Is there a fast (and ideally vectorized) way to sample differently for each row?
Upvotes: 1
Views: 130
Reputation: 71689
Let us try with np.random.rand
and argsort
to generate shuffled indices
i = np.random.rand(*df.shape).argsort(1)
df.values[:] = np.take_along_axis(df.to_numpy(), i, axis=1)
print(df)
foo bar baz
0 3 1 2
1 4 5 6
2 7 9 8
Upvotes: 2
Reputation: 3633
You can try this solution:
def shuffle_columns_per_row(df):
arr = df.values
x, y = arr.shape
rows = np.indices((x,y))[0]
cols = [np.random.permutation(y) for _ in range(x)]
return pd.DataFrame(arr[rows, cols], columns=df.columns)
| foo | bar | baz |
|-----|-----|-----|
| 3 | 2 | 1 |
| 5 | 6 | 4 |
| 9 | 7 | 8 |
Upvotes: 1
Reputation: 1644
A quick check gives the benchmark:
%%timeit
df.sample(frac=1, axis=1)
# 1000 loops, best of 5: 288 µs per loop
With apply
, as you said, we get:
%%timeit
idx = np.random.choice([0, 1, 2], size=(3,), replace=False)
df.apply(lambda x: x.iloc[idx], axis=1)
# 1000 loops, best of 5: 1.47 ms per loop -> ~3700 times slower
We could rather use iloc
:
%%timeit
idx = np.random.choice([0, 1, 2], size=(3,), replace=False)
df.iloc[:, idx]
# 1000 loops, best of 5: 398 µs per loop -> ~1.4 times slower
If you could live with a roughly 1.4 times decrease in speed, I think the iloc
version would work.
Upvotes: 0