Ahmad
Ahmad

Reputation: 9658

How can I sample from a dataframe weighted by groupby column

Here is a sampling method. I tried:

sample=2000 
sample_df = df.groupby('prefix').sample(n=sample, random_state=1)

It groups df by prefix and for each group, it samples 2k items. I have 9 groups. I want to sample 18k but weighted by the number in each group.

Upvotes: 2

Views: 2133

Answers (1)

Mustafa Aydın
Mustafa Aydın

Reputation: 18306

IIUC, here is one way:

sample = 2000
col_name = "prefix"

probs = df[col_name].map(df[col_name].value_counts())
sample_df = df.sample(n=sample, weights=probs)

probs are the corresponding (unnormalized) weights for each value in prefix column, and we sample according to that.


Steps on some sample data:

>>> df

       B         C         D
0   this  0.469112 -0.861849
1   this -0.282863 -2.104569
2  other -1.509059 -0.494929
3   view -1.135632  1.071804
4  other  1.212112  0.721555
5  other -0.173215 -0.706771
6   this  0.119209 -1.039575
7   view -1.044236  0.271860
8  other  0.322124  2.010234

>>> col_name = "B"
>>> sample = 4

>>> counts = df[col_name].value_counts()
>>> counts

other    4
this     3
view     2
Name: B, dtype: int64

>>> probs = df[col_name].map(counts)
>>> probs

0    3
1    3
2    4
3    2
4    4
5    4
6    3
7    2
8    4
Name: B, dtype: int64

# seeing side-by-side with df.B
>>> pd.concat([df.B, probs], axis=1)

0   this  3
1   this  3
2  other  4
3   view  2
4  other  4
5  other  4
6   this  3
7   view  2
8  other  4

i.e., each value in col_name is attached a number which, in relative, measures its weight inferred from its count in the column.

# sampling:
>>> sample_df = df.sample(n=sample, weights=probs, random_state=1284)
>>> sample_df

       B         C         D
6   this  0.119209 -1.039575
3   view -1.135632  1.071804
2  other -1.509059 -0.494929
5  other -0.173215 -0.706771

Upvotes: 1

Related Questions