stav
stav

Reputation: 109

How to assign values randomly between dataframes

I am trying to randomly assign values from one column in one dataframe, to another dataframe within 12 different categories (by agerange and gender). For example I have two dataframes; lets call one d1 and the other d2

  d1:
index agerange gender income
 0     2        1      56700
 1     2        0      25600
 2     4        0      3000
 3     4        0      106000
 4     3        0      200
 5     3        0      43000
 6     4        0      10000000

d2:
index agerange gender 
 0     3        0      
 1     2        0      
 2     4        0      
 3     4        0      

I want to group both dataframes by agerange and gender i.e 0-1,2,3,4,5,6 & 1-1,2,3,4,5,6 then randomly chose one of the incomes within d1 and assign it to d2.

ie:

d1:
index agerange gender income
 0     2        1      56700
 1     2        0      25600
 2     4        0      3000
 3     4        0      106000
 4     3        0      200
 5     3        0      43000
 6     4        0      10000000

d2:
index agerange gender  income
 0     3        0      200  
 1     2        0      25600 
 2     4        0      10000000
 3     4        0      3000

Upvotes: 5

Views: 399

Answers (3)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

How about creating a dictionary of incomes based on ageranges and then map the random choice i.e

#Based on unutbu's data
df1 = pd.DataFrame({'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000], 'index': [0, 1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0], 'index': [0, 1, 2, 3]})

age_groups = df1.groupby('agerange')['income'].agg(lambda x: tuple(x)).to_dict()
df2['income'] = df2['agerange'].map(lambda x: np.random.choice(age_groups[x]))

Output :

  agerange  gender  index  income
0         3       0      0   43000
1         2       0      1   25600
2         4       0      2  106000
3         4       0      3  106000

If gender group is also required then you can use apply if you want to fill 0 for keys not found you can use if else i.e

df2 = pd.DataFrame({'agerange': [3, 2, 6, 4], 'gender': [0, 0, 0, 0], 'index': [0, 1, 2, 3]})
df1 = pd.DataFrame({'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000], 'index': [0, 1, 2, 3, 4, 5, 6]})


age_groups = df1.groupby(['agerange','gender'])['income'].agg(lambda x: tuple(x)).to_dict()
df2['income'] = df2.apply(lambda x: np.random.choice(age_groups[x['agerange'],x['gender']]) if (x['agerange'],x['gender']) in age_groups else 0,axis=1)

Output :

   agerange  gender  index  income
0         3       0      0   43000
1         2       0      1   25600
2         6       0      2       0
3         4       0      3  106000

Upvotes: 3

piRSquared
piRSquared

Reputation: 294258

Option 1
An approach with np.random.choice and pd.DataFrame.query
I'm making an implicit assumption that we replace randomly drawn values for every row.

def take_one(x):
    q = 'agerange == {agerange} and gender == {gender}'.format(**x)
    return np.random.choice(d1.query(q).income)

d2.assign(income=d2.apply(take_one, 1))

       agerange  gender  income
index                          
0             3       0     200
1             2       0   25600
2             4       0  106000
3             4       0  106000

Option 2
Attempting to make it more efficient to call np.random.choice once per group.

g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.get(x.name, [0] * len(x)), len(x)), x.index)
d2.groupby(['agerange', 'gender'], group_keys=False).apply(f)

       agerange  gender    income
index                            
0             3       0       200
1             2       0     25600
2             4       0  10000000
3             4       0    106000

Debugging and Setup

import pandas as pd
import numpy as np

d1 = pd.DataFrame({
        'agerange': [2, 2, 4, 4, 3, 3, 4],
        'gender': [1, 0, 0, 0, 0, 0, 0],
        'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000]
    }, pd.Index([0, 1, 2, 3, 4, 5, 6], name='index')
)

d2 = pd.DataFrame(
    {'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0]},
    pd.Index([0, 1, 2, 3], name='index')
)

g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.loc[x.name], len(x)), x.index)
d2.assign(income=d2.groupby(['agerange', 'gender'], group_keys=False).apply(f))

       agerange  gender  income
index                          
0             3       0     200
1             2       0   25600
2             4       0  106000
3             4       0    3000

Upvotes: 4

Scott Boston
Scott Boston

Reputation: 153460

d2['income'] = d2.apply(lambda x: d1.loc[(d1.agerange==x.agerange) &(d1.gender == x.gender),'income'].sample(n=1).max(),axis=1)

Output:

   index  agerange  gender  income
0      0         3       0     200
1      1         2       0   25600
2      2         4       0    3000
3      3         4       0  106000

Upvotes: 3

Related Questions