Reputation: 3898
I can sample a
from each grouped b
as follows.
df = pd.DataFrame({'a': [10,20,30,40,50,60,70],
'b': [1,1,1,0,0,0,0]})
df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(n=3))
Gives:
b a
0 3 40
4 50
5 60
1 0 10
2 30
1 20
However, if I want to sample n
elements, then n
must be set to at most the number of elements in a group (if we want replace=False
)
Is there a clean way to sample n
elements up to the maximum number of items in a group?
For example, in the given DataFrame: in b
, there are three items with the value 1
.
If I wanted df.groupby('b').apply(lambda x: x.sample(n=4))
, (notice n=4
) this would break.
What's a clean way to sample up to the maximum for each group?
Upvotes: 0
Views: 431
Reputation: 29742
Wrap it with min
would be an option:
df = pd.DataFrame({'a': [10,20,30,40,50,60,70],
'b': [1,1,1,0,0,0,0]})
n = 4
df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(n=min(10, len(x))))
Output:
0 3 40
4 50
6 70
5 60
1 2 30
1 20
0 10
Name: a, dtype: int64
Or if you always want to sample the max (i.e. random shuffle), use frac
:
df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(frac=1))
Output:
0 6 70
4 50
5 60
3 40
1 2 30
1 20
0 10
Name: a, dtype: int64
Note that from pandas-1.1.0
, you can directly access sample
from groupby object.
Upvotes: 2
Reputation: 3898
You can adaptively modify the sample size by comparing a pre-specified max sample size with the size of the group, as so.
max_sample = 4
df.groupby('b')['a'].apply(lambda x: x.sample(n=max_sample if len(x)>max_sample else len(x)))
Upvotes: 0