Reputation: 9658
Here is a sampling method. I tried:
sample=2000
sample_df = df.groupby('prefix').sample(n=sample, random_state=1)
It groups df
by prefix
and for each group, it samples 2k items. I have 9 groups. I want to sample 18k but weighted by the number in each group.
Upvotes: 2
Views: 2133
Reputation: 18306
IIUC, here is one way:
sample = 2000
col_name = "prefix"
probs = df[col_name].map(df[col_name].value_counts())
sample_df = df.sample(n=sample, weights=probs)
probs
are the corresponding (unnormalized) weights for each value in prefix
column, and we sample according to that.
Steps on some sample data:
>>> df
B C D
0 this 0.469112 -0.861849
1 this -0.282863 -2.104569
2 other -1.509059 -0.494929
3 view -1.135632 1.071804
4 other 1.212112 0.721555
5 other -0.173215 -0.706771
6 this 0.119209 -1.039575
7 view -1.044236 0.271860
8 other 0.322124 2.010234
>>> col_name = "B"
>>> sample = 4
>>> counts = df[col_name].value_counts()
>>> counts
other 4
this 3
view 2
Name: B, dtype: int64
>>> probs = df[col_name].map(counts)
>>> probs
0 3
1 3
2 4
3 2
4 4
5 4
6 3
7 2
8 4
Name: B, dtype: int64
# seeing side-by-side with df.B
>>> pd.concat([df.B, probs], axis=1)
0 this 3
1 this 3
2 other 4
3 view 2
4 other 4
5 other 4
6 this 3
7 view 2
8 other 4
i.e., each value in col_name
is attached a number which, in relative, measures its weight inferred from its count in the column.
# sampling:
>>> sample_df = df.sample(n=sample, weights=probs, random_state=1284)
>>> sample_df
B C D
6 this 0.119209 -1.039575
3 view -1.135632 1.071804
2 other -1.509059 -0.494929
5 other -0.173215 -0.706771
Upvotes: 1