Mysterious
Mysterious

Reputation: 881

Assign random values in a pandas column proportionately by group

df dataframe like this:

ID   Category   Result
1     A          ...
2     B          ...
3     B          ...
4     C          ...

Wherever the category is A, in them, assign three values of result(Pass,Fail,Hold) in random order given the proportion being 30,40,30 for each Result type. Similarly for other categories with different proportions. Any fast way to do this?

Currently I am using

np.split(df[cond],int([.3*len(df[cond])]),int([.7*len(df[cond])]))

to split the data into proportions followed by

df1[Result] = 'Pass'
df2[Result] = 'Fail'...
pd.concat([df1,df2,...all conditioned columns frames])

to get the full set.

Upvotes: 0

Views: 2003

Answers (1)

jpp
jpp

Reputation: 164663

Here's an idea. You can use GroupBy with np.random.choice.

This doesn't guarantee your proportions are kept. For example, if you only have one row for a specific category, it cannot guarantee your proportions are kept if your weights are all non-zero. Even if they can be kept, the logic still uses "random" numbers to select each value. What you can say, with this method, is as your number of rows tends towards infinity, the ratios will tend towards the provided weights.

values = ['Pass', 'Fail', 'Hold']
weights = {'A': [0.3, 0.4, 0.3], 'B': [0.6, 0.2, 0.2]}

df = pd.DataFrame({'Category': list('A'*10 + 'B'*5)})

np.random.seed(0)

def apply_randoms(x):
    key = x['Category'].iat[0]
    return pd.Series(np.random.choice(values, size=len(x), p=weights[key]))

df = df.groupby('Category').apply(apply_randoms)\
       .rename('Result').reset_index().drop('level_1', 1)

print(df)

   Category Result
0         A   Hold
1         A   Fail
2         A   Fail
3         A   Hold
4         A   Pass
5         A   Pass
6         A   Pass
7         A   Hold
8         A   Hold
9         A   Hold
10        B   Hold
11        B   Fail
12        B   Pass
13        B   Fail
14        B   Pass

Upvotes: 3

Related Questions