Reputation: 21
I would like to shuffle some column values but only within a certain group and only a certain percentage of rows within the group. For example, per group, I want to shuffle n% of values in column b with each other.
df = pd.DataFrame({'grouper_col':[1,1,2,3,3,3,3,4,4], 'b':[12, 13, 16, 21, 14, 11, 12, 13, 15]})
grouper_col b
0 1 12
1 1 13
2 2 16
3 3 21
4 3 14
5 3 11
6 3 12
7 4 13
8 4 15
Example output:
grouper_col b
0 1 13
1 1 12
2 2 16
3 3 21
4 3 11
5 3 14
6 3 12
7 4 15
8 4 13
I found
df.groupby("grouper_col")["b"].transform(np.random.permutation)
but then I have no control over the percentage of shuffled values.
Thank you for any hints!
Upvotes: 1
Views: 531
Reputation: 1548
You can use numpy
to create a function like this (it takes a numpy array for input)
import numpy as np
def shuffle_portion(arr, percentage):
shuf = np.random.choice(np.arange(arr.shape[0]),
round(arr.shape[0]*percentage/100),
replace=False)
arr[np.sort(shuf)] = arr[shuf]
return arr
np.random.choice
will choose a set of indexes with the size you need. Then the corresponding values in the given array can be rearranged in the shuffled order. Now this should shuffle 3 values out of the 9 in cloumn 'b'
df['b'] = shuffle_portion(df['b'].values, 33)
EDIT:
To use with apply
, you need to convert the passed dataframe to an array inside the function (explained in the comments) and create the return dataframe as well
def shuffle_portion(_df, percentage=50):
arr = _df['b'].values
shuf = np.random.choice(np.arange(arr.shape[0]),
round(arr.shape[0]*percentage/100),
replace=False)
arr[np.sort(shuf)] = arr[shuf]
_df['b'] = arr
return _df
Now you can just do
df.groupby("grouper_col", as_index=False).apply(shuffle_portion)
It would be better practice if you pass the name of the column which you need to shuffle, to the function (def shuffle_portion(_df, col='b', percentage=50): arr = _df[col].values ...
)
Upvotes: 1