Ian
Ian

Reputation: 3898

Pandas groupby sampling - ignore case where sample is greater than number of elements

I can sample a from each grouped b as follows.

df = pd.DataFrame({'a': [10,20,30,40,50,60,70],
                   'b': [1,1,1,0,0,0,0]})

df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(n=3))

Gives:

b       a
0  3    40
   4    50
   5    60
1  0    10
   2    30
   1    20

However, if I want to sample n elements, then n must be set to at most the number of elements in a group (if we want replace=False)

Is there a clean way to sample n elements up to the maximum number of items in a group?

For example, in the given DataFrame: in b, there are three items with the value 1.

If I wanted df.groupby('b').apply(lambda x: x.sample(n=4)), (notice n=4) this would break.

What's a clean way to sample up to the maximum for each group?

Upvotes: 0

Views: 431

Answers (2)

Chris
Chris

Reputation: 29742

Wrap it with min would be an option:

df = pd.DataFrame({'a': [10,20,30,40,50,60,70],
                   'b': [1,1,1,0,0,0,0]})

n = 4
df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(n=min(10, len(x))))

Output:

0  3    40
   4    50
   6    70
   5    60
1  2    30
   1    20
   0    10
Name: a, dtype: int64

Or if you always want to sample the max (i.e. random shuffle), use frac:

df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(frac=1))

Output:

0  6    70
   4    50
   5    60
   3    40
1  2    30
   1    20
   0    10
Name: a, dtype: int64

Note that from pandas-1.1.0, you can directly access sample from groupby object.

Upvotes: 2

Ian
Ian

Reputation: 3898

You can adaptively modify the sample size by comparing a pre-specified max sample size with the size of the group, as so.

max_sample = 4
df.groupby('b')['a'].apply(lambda x: x.sample(n=max_sample if len(x)>max_sample else len(x)))

Upvotes: 0

Related Questions