johnjohn
johnjohn

Reputation: 892

Random sample by group: how to specify n, not weight? (using DataFrameGroupBy.sample)

This question follows this question (I was asked to post it as a new question by other contributors).

We have this mock df:

df = pd.DataFrame({
        'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'country': ['USA', 'USA', 'USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 'USA', 'Canada']
})

Let's say I want to sample 4 random rows from USA and 2 random rows from Canada. I've tried:

df.groupby("country").sample(n=[4, 2])

This returns an error. The mistake is probably the use of square brackets. How to specify different n for each group, then?

Note ideally I need a solution using df.groupby.sample. Also note I need to specify n, not proportion or weight as in documentation (see here). Finally note I also need to set a seed. Thank you

Upvotes: 1

Views: 275

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71689

You can group the dataframe on country then .sample each group separately where the number of samples to take can be obtained from the dictionary, finally .concat all the sampled groups:

d = {'USA': 4, 'Canada': 2} # mapping dict
pd.concat([g.sample(d[k]) for k, g in df.groupby('country', sort=False)])

   id country
0   1     USA
4   5     USA
1   2     USA
2   3     USA
6   7  Canada
9  10  Canada

Upvotes: 2

Related Questions