Pandas dataframe randomly shuffle some column values in groups

Question

I would like to shuffle some column values but only within a certain group and only a certain percentage of rows within the group. For example, per group, I want to shuffle n% of values in column b with each other.

df = pd.DataFrame({'grouper_col':[1,1,2,3,3,3,3,4,4], 'b':[12, 13, 16, 21, 14, 11, 12, 13, 15]})

   grouper_col   b
0            1  12
1            1  13
2            2  16
3            3  21
4            3  14
5            3  11
6            3  12
7            4  13
8            4  15

Example output:

   grouper_col   b
0            1  13
1            1  12
2            2  16
3            3  21
4            3  11
5            3  14
6            3  12
7            4  15
8            4  13

I found

df.groupby("grouper_col")["b"].transform(np.random.permutation)

but then I have no control over the percentage of shuffled values.

Thank you for any hints!

doca · Accepted Answer

You can use numpy to create a function like this (it takes a numpy array for input)

import numpy as np

def shuffle_portion(arr, percentage): 
    shuf = np.random.choice(np.arange(arr.shape[0]),  
                            round(arr.shape[0]*percentage/100), 
                            replace=False) 
    arr[np.sort(shuf)] = arr[shuf] 
    return arr

np.random.choice will choose a set of indexes with the size you need. Then the corresponding values in the given array can be rearranged in the shuffled order. Now this should shuffle 3 values out of the 9 in cloumn 'b'

df['b'] = shuffle_portion(df['b'].values, 33)

EDIT: To use with apply, you need to convert the passed dataframe to an array inside the function (explained in the comments) and create the return dataframe as well

def shuffle_portion(_df, percentage=50): 
    arr = _df['b'].values
    shuf = np.random.choice(np.arange(arr.shape[0]),  
                            round(arr.shape[0]*percentage/100), 
                            replace=False) 
    arr[np.sort(shuf)] = arr[shuf] 
    _df['b'] = arr
    return _df

Now you can just do

df.groupby("grouper_col", as_index=False).apply(shuffle_portion)

It would be better practice if you pass the name of the column which you need to shuffle, to the function (def shuffle_portion(_df, col='b', percentage=50): arr = _df[col].values ...)

Pandas dataframe randomly shuffle some column values in groups

Answers (1)

Related Questions