How to sample a pandas dataframe selecting X rows from group 1 but Y rows from group2

Question

Imagine a Students/Grades dataframe such that

Using pandas, how can I create multiple groups such that each group has 1 student with an A, 2 students with Bs, and 1 student with C?

I've tried using pandas' GroupBy['Grade'] and then Sample from each grade-group. The problem with this is that it gives me the same number of students from each grade-group, however, I'd like a specific number of students from each specific grade-group.

The solution shouldn't care about the "left overs". If I have a fully formed set that follows the required distribution, I'd be happy.

Thanks for any help,

Seshadri · Accepted Answer

You can do that by using a dictionary to store the number of samples from each group, as shown below:

import pandas as pd
import numpy as np

# create the dataframe
df = pd.DataFrame(zip(['Person'+ str(i+1) for i in range(30)],
                 np.random.choice(['A','B', 'C'], 30, replace=True)),
             columns = ['Student','Grade'])

# use a dict to store the sample frequencies
sample_freq = {'A':1, 'B':2, 'C':3}

# group by desired variable
groups = df.groupby('Grade')

# sample from each group and concatenate them to a single data frame
pd.concat(
    [group_df.sample(sample_freq[group]) for group,group_df in groups])

How to sample a pandas dataframe selecting X rows from group 1 but Y rows from group2

Answers (1)

Related Questions