Sampling rows from Pandas DataFrame conditionally

Question

I have a pandas DataFrame in which certain people are over-represented. I would like to subsample, capping the number of observations of each to some maximum amount.

Right now I'm doing this in a loop and trying to build a DataFrame out of dicts. But the index is getting in the way and am hoping there's some easier solution that someone can point me to. Real data, has ~20K rows, ~4K cols, and ~400 people. Thanks.

Example Data.

df = pd.DataFrame({'name': ["Alice", "Alice", "Charles", "Charles", "Charles", "Kumar", "Kumar", "Kumar", "Kumar"],
              'height': [124, 125, 169, 178, 177, 172, 173, 175, 174]})

df
    height name
0   124 Alice
1   125 Alice
2   169 Charles
3   178 Charles
4   177 Charles
5   172 Kumar
6   173 Kumar
7   175 Kumar
8   174 Kumar

My code now, for this example trying to cap everyone at 2 rows each.

sub_df = []
for name in pd.unique(df.name):
    sub_df.append(df[df.name == name].sample(n=2, random_state=42).to_dict())

pd.DataFrame(sub_df)

What I'm getting.

    height               name
0   {1: 125, 0: 124}    {1: 'Alice', 0: 'Alice'}
1   {2: 169, 3: 178}    {2: 'Charles', 3: 'Charles'}
2   {6: 174, 8: 175}    {6: 'Kumar', 8: 'Kumar'}

What I want.

    height name
0   125 Alice
1   124 Alice
2   169 Charles
3   178 Charles
4   174 Kumar
5   175 Kumar

root · Accepted Answer

Perform a groupby on 'name', then use sample:

# groupby and sample
df = df.groupby('name').apply(lambda grp: grp.sample(n=2))

# formatting
df = df.reset_index(drop=True)

The resulting output:

   height     name
0     125    Alice
1     124    Alice
2     177  Charles
3     169  Charles
4     175    Kumar
5     173    Kumar

Sampling rows from Pandas DataFrame conditionally

Answers (1)

Related Questions