Reputation: 109
I am trying to randomly assign values from one column in one dataframe, to another dataframe within 12 different categories (by agerange and gender). For example I have two dataframes; lets call one d1 and the other d2
d1:
index agerange gender income
0 2 1 56700
1 2 0 25600
2 4 0 3000
3 4 0 106000
4 3 0 200
5 3 0 43000
6 4 0 10000000
d2:
index agerange gender
0 3 0
1 2 0
2 4 0
3 4 0
I want to group both dataframes by agerange and gender i.e 0-1,2,3,4,5,6 & 1-1,2,3,4,5,6 then randomly chose one of the incomes within d1 and assign it to d2.
ie:
d1:
index agerange gender income
0 2 1 56700
1 2 0 25600
2 4 0 3000
3 4 0 106000
4 3 0 200
5 3 0 43000
6 4 0 10000000
d2:
index agerange gender income
0 3 0 200
1 2 0 25600
2 4 0 10000000
3 4 0 3000
Upvotes: 5
Views: 399
Reputation: 30605
How about creating a dictionary of incomes based on ageranges and then map the random choice i.e
#Based on unutbu's data
df1 = pd.DataFrame({'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000], 'index': [0, 1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0], 'index': [0, 1, 2, 3]})
age_groups = df1.groupby('agerange')['income'].agg(lambda x: tuple(x)).to_dict()
df2['income'] = df2['agerange'].map(lambda x: np.random.choice(age_groups[x]))
Output :
agerange gender index income 0 3 0 0 43000 1 2 0 1 25600 2 4 0 2 106000 3 4 0 3 106000
If gender group is also required then you can use apply if you want to fill 0 for keys not found you can use if else i.e
df2 = pd.DataFrame({'agerange': [3, 2, 6, 4], 'gender': [0, 0, 0, 0], 'index': [0, 1, 2, 3]})
df1 = pd.DataFrame({'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000], 'index': [0, 1, 2, 3, 4, 5, 6]})
age_groups = df1.groupby(['agerange','gender'])['income'].agg(lambda x: tuple(x)).to_dict()
df2['income'] = df2.apply(lambda x: np.random.choice(age_groups[x['agerange'],x['gender']]) if (x['agerange'],x['gender']) in age_groups else 0,axis=1)
Output :
agerange gender index income 0 3 0 0 43000 1 2 0 1 25600 2 6 0 2 0 3 4 0 3 106000
Upvotes: 3
Reputation: 294258
Option 1
An approach with np.random.choice
and pd.DataFrame.query
I'm making an implicit assumption that we replace randomly drawn values for every row.
def take_one(x):
q = 'agerange == {agerange} and gender == {gender}'.format(**x)
return np.random.choice(d1.query(q).income)
d2.assign(income=d2.apply(take_one, 1))
agerange gender income
index
0 3 0 200
1 2 0 25600
2 4 0 106000
3 4 0 106000
Option 2
Attempting to make it more efficient to call np.random.choice
once per group.
g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.get(x.name, [0] * len(x)), len(x)), x.index)
d2.groupby(['agerange', 'gender'], group_keys=False).apply(f)
agerange gender income
index
0 3 0 200
1 2 0 25600
2 4 0 10000000
3 4 0 106000
Debugging and Setup
import pandas as pd
import numpy as np
d1 = pd.DataFrame({
'agerange': [2, 2, 4, 4, 3, 3, 4],
'gender': [1, 0, 0, 0, 0, 0, 0],
'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000]
}, pd.Index([0, 1, 2, 3, 4, 5, 6], name='index')
)
d2 = pd.DataFrame(
{'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0]},
pd.Index([0, 1, 2, 3], name='index')
)
g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.loc[x.name], len(x)), x.index)
d2.assign(income=d2.groupby(['agerange', 'gender'], group_keys=False).apply(f))
agerange gender income
index
0 3 0 200
1 2 0 25600
2 4 0 106000
3 4 0 3000
Upvotes: 4
Reputation: 153460
d2['income'] = d2.apply(lambda x: d1.loc[(d1.agerange==x.agerange) &(d1.gender == x.gender),'income'].sample(n=1).max(),axis=1)
Output:
index agerange gender income
0 0 3 0 200
1 1 2 0 25600
2 2 4 0 3000
3 3 4 0 106000
Upvotes: 3