Pandas groupby sampling - ignore case where sample is greater than number of elements

Question

I can sample a from each grouped b as follows.

df = pd.DataFrame({'a': [10,20,30,40,50,60,70],
                   'b': [1,1,1,0,0,0,0]})

df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(n=3))

Gives:

However, if I want to sample n elements, then n must be set to at most the number of elements in a group (if we want replace=False)

Is there a clean way to sample n elements up to the maximum number of items in a group?

For example, in the given DataFrame: in b, there are three items with the value 1.

If I wanted df.groupby('b').apply(lambda x: x.sample(n=4)), (notice n=4) this would break.

What's a clean way to sample up to the maximum for each group?

Chris · Accepted Answer

Wrap it with min would be an option:

df = pd.DataFrame({'a': [10,20,30,40,50,60,70],
                   'b': [1,1,1,0,0,0,0]})

n = 4
df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(n=min(10, len(x))))

Output:

0  3    40
   4    50
   6    70
   5    60
1  2    30
   1    20
   0    10
Name: a, dtype: int64

Or if you always want to sample the max (i.e. random shuffle), use frac:

df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(frac=1))

Output:

0  6    70
   4    50
   5    60
   3    40
1  2    30
   1    20
   0    10
Name: a, dtype: int64

Note that from pandas-1.1.0, you can directly access sample from groupby object.

Pandas groupby sampling - ignore case where sample is greater than number of elements

Answers (2)

Related Questions