Reputation: 13
Suppose I have the following dataframe:
Type Name
S2019 John
S2019 Stephane
S2019 Mike
S2019 Hamid
S2021 Rahim
S2021 Ahamed
I want to groupby the dataset based on "Type" and then add a new column named as "Sampled" and randomly add yes/no to each row, the yes/no should be distributed equally. The expected dataframe can be:
Type Name Sampled
S2019 John no
S2019 Stephane yes
S2019 Mike yes
S2019 Hamid no
S2021 Rahim yes
S2021 Ahamed no
Upvotes: 1
Views: 124
Reputation: 260580
You can use numpy.random.choice
:
import numpy as np
df['Sampled'] = np.random.choice(['yes', 'no'], size=len(df))
output:
Type Name Sampled
0 S2019 John no
1 S2019 Stephane no
2 S2019 Mike yes
3 S2019 Hamid no
4 S2021 Rahim no
5 S2021 Ahamed yes
df['Sampled'] = (df.groupby('Type')['Type']
.transform(lambda g: np.random.choice(['yes', 'no'],
size=len(g)))
)
For each group, get an arbitrary column (here Type, but it doesn't matter, this is just to have a shape of 1), and apply np.random.choice
with the length of the group as parameter. This gives as many yes or no as the number of items in the group with an equal probability (note that you can define a specific probability per item if you want).
NB. equal probability does not mean you will get necessarily 50/50 of yes/no, if this is what you want please clarify
If you want half each kind (yes/no) (±1 in case of odd size), you can select randomly half of the indices.
idx = df.groupby('Type', group_keys=False).apply(lambda g: g.sample(n=len(g)//2)).index
df['Sampled'] = np.where(df.index.isin(idx), 'yes', 'no')
NB. in case of odd number, there will be one more of the second item defined in the np.where
function, here "no".
This will distribute equally, in the limit of multiplicity. This means, for 3 elements and 4 places, there will be two a, one b, one c in random order. If you want the extra item(s) to be chosen randomly, first shuffle the input.
elem = ['a', 'b', 'c']
df['Sampled'] = (df
.groupby('Type', group_keys=False)['Type']
.transform(lambda g: np.random.choice(np.tile(elem, int(np.ceil(len(g)/len(elem))))[:len(g)],
size=len(g), replace=False))
)
output:
Type Name Sampled
0 S2019 John a
1 S2019 Stephane a
2 S2019 Mike b
3 S2019 Hamid c
4 S2021 Rahim a
5 S2021 Ahamed b
Upvotes: 1
Reputation: 862611
Use custom function in GroupBy.transform
with create helper array arr
by equally distibuted values yes, no
and then randomize order by numpy.random.shuffle
:
def f(x):
arr = np.full(len(x), ['no'], dtype=object)
arr[:int(len(x) * 0.5)] = 'yes'
np.random.shuffle(arr)
return arr
df['Sampled'] = df.groupby('Type')['Name'].transform(f)
print (df)
Type Name Sampled
0 S2019 John yes
1 S2019 Stephane no
2 S2019 Mike no
3 S2019 Hamid yes
4 S2021 Rahim no
5 S2021 Ahamed yes
Upvotes: 1