Reputation: 881
df dataframe like this:
ID Category Result
1 A ...
2 B ...
3 B ...
4 C ...
Wherever the category is A, in them, assign three values of result(Pass,Fail,Hold) in random order given the proportion being 30,40,30 for each Result type. Similarly for other categories with different proportions. Any fast way to do this?
Currently I am using
np.split(df[cond],int([.3*len(df[cond])]),int([.7*len(df[cond])]))
to split the data into proportions followed by
df1[Result] = 'Pass'
df2[Result] = 'Fail'...
pd.concat([df1,df2,...all conditioned columns frames])
to get the full set.
Upvotes: 0
Views: 2003
Reputation: 164663
Here's an idea. You can use GroupBy
with np.random.choice
.
This doesn't guarantee your proportions are kept. For example, if you only have one row for a specific category, it cannot guarantee your proportions are kept if your weights are all non-zero. Even if they can be kept, the logic still uses "random" numbers to select each value. What you can say, with this method, is as your number of rows tends towards infinity, the ratios will tend towards the provided weights.
values = ['Pass', 'Fail', 'Hold']
weights = {'A': [0.3, 0.4, 0.3], 'B': [0.6, 0.2, 0.2]}
df = pd.DataFrame({'Category': list('A'*10 + 'B'*5)})
np.random.seed(0)
def apply_randoms(x):
key = x['Category'].iat[0]
return pd.Series(np.random.choice(values, size=len(x), p=weights[key]))
df = df.groupby('Category').apply(apply_randoms)\
.rename('Result').reset_index().drop('level_1', 1)
print(df)
Category Result
0 A Hold
1 A Fail
2 A Fail
3 A Hold
4 A Pass
5 A Pass
6 A Pass
7 A Hold
8 A Hold
9 A Hold
10 B Hold
11 B Fail
12 B Pass
13 B Fail
14 B Pass
Upvotes: 3