Reputation: 6132
I'm working with a dataframe like this:
group period
A 20130101
A 20130201
. .
E 20130901
E 20131001
Let's say I have 100 different groups and 10 possible dates, which are distributed like this: [.1,.05,.2,.05,.1,.1,.2,.05,.05,.1]
. I need to get one sample for each group, so 10% of the final sample is obtained from the first period, 5% from the second period, 20% fom the third period, and so on. I managed to get a random sample for each group, but it's heavily skewed, like this:
fn = lambda obj: obj.loc[np.random.choice(obj.index, 1, replace=False),:]
dfrd = df[['group','period']].groupby('group', as_index=False).apply(fn)
dfrd.index = [index[1] for index in dfrd.index]
So, is there any way to do something similar, but stratified? Thanks
Upvotes: 0
Views: 198
Reputation: 21709
You can use p
parameter from np.random.choice
:
df1 = (df
.groupby('grp')
.apply(lambda x: np.random.choice(x['period'].values, size=1, p=prob)[0])
.reset_index()
.rename(columns={0:'period'}))
grp period
0 A 2013-01-03
1 B 2013-01-04
2 C 2013-01-04
3 D 2013-01-03
Sample Data
period = list(map(str, pd.date_range(start='20130101', freq='D', periods=10).date))
grp = sorted(['A','B','C','D']*10)
prob = [.1,.05,.2,.05,.1,.1,.2,.05,.05,.1]
df = pd.DataFrame({'grp': grp, 'period': period*4})
Upvotes: 2