Reputation: 892
This question follows this question (I was asked to post it as a new question by other contributors).
We have this mock df:
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'country': ['USA', 'USA', 'USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 'USA', 'Canada']
})
Let's say I want to sample 4 random rows from USA and 2 random rows from Canada. I've tried:
df.groupby("country").sample(n=[4, 2])
This returns an error. The mistake is probably the use of square brackets. How to specify different n for each group, then?
Note ideally I need a solution using df.groupby.sample. Also note I need to specify n, not proportion or weight as in documentation (see here). Finally note I also need to set a seed. Thank you
Upvotes: 1
Views: 275
Reputation: 71689
You can group
the dataframe on country
then .sample
each group separately where the number of samples to take can be obtained from the dictionary, finally .concat
all the sampled groups:
d = {'USA': 4, 'Canada': 2} # mapping dict
pd.concat([g.sample(d[k]) for k, g in df.groupby('country', sort=False)])
id country
0 1 USA
4 5 USA
1 2 USA
2 3 USA
6 7 Canada
9 10 Canada
Upvotes: 2