Reputation: 187
I have a pandas DataFrame in which certain people are over-represented. I would like to subsample, capping the number of observations of each to some maximum amount.
Right now I'm doing this in a loop and trying to build a DataFrame out of dicts. But the index is getting in the way and am hoping there's some easier solution that someone can point me to. Real data, has ~20K rows, ~4K cols, and ~400 people. Thanks.
Example Data.
df = pd.DataFrame({'name': ["Alice", "Alice", "Charles", "Charles", "Charles", "Kumar", "Kumar", "Kumar", "Kumar"],
'height': [124, 125, 169, 178, 177, 172, 173, 175, 174]})
df
height name
0 124 Alice
1 125 Alice
2 169 Charles
3 178 Charles
4 177 Charles
5 172 Kumar
6 173 Kumar
7 175 Kumar
8 174 Kumar
My code now, for this example trying to cap everyone at 2 rows each.
sub_df = []
for name in pd.unique(df.name):
sub_df.append(df[df.name == name].sample(n=2, random_state=42).to_dict())
pd.DataFrame(sub_df)
What I'm getting.
height name
0 {1: 125, 0: 124} {1: 'Alice', 0: 'Alice'}
1 {2: 169, 3: 178} {2: 'Charles', 3: 'Charles'}
2 {6: 174, 8: 175} {6: 'Kumar', 8: 'Kumar'}
What I want.
height name
0 125 Alice
1 124 Alice
2 169 Charles
3 178 Charles
4 174 Kumar
5 175 Kumar
Upvotes: 3
Views: 471
Reputation: 33793
Perform a groupby
on 'name'
, then use sample
:
# groupby and sample
df = df.groupby('name').apply(lambda grp: grp.sample(n=2))
# formatting
df = df.reset_index(drop=True)
The resulting output:
height name
0 125 Alice
1 124 Alice
2 177 Charles
3 169 Charles
4 175 Kumar
5 173 Kumar
Upvotes: 3