Sklearn: Take only few records from each target class

Question

I have a large dataset with multi-class classification ( 3 classes), I want to take sub-sample of data i.e take 200 records belonging to each class and the upon that data, I want split the data.

Say the 3 classes are cat, dog, cow. I want to apply a split on the subset of data where there are 200 records selected out of a large dataset for each of the class cat, dog, cow to train the ML model.

cat - 200 observations
dog - 200 observations
cow - 200 observations

This is the code line for splitting the data:

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3, 
                                                          random_state = 42)

How can I select X and y such that it has 200 records for each of the class?

imdevskp · Accepted Answer

Take equal no. of samples based on each class

df_cats = df[df['class'] == 'cat'][:200]
df_dogs = df[df['class'] == 'dog'][:200]
df_cows = df[df['class'] == 'cow'][:200]

Then concatenate dataframes

df_new = pd.concat([df_cats, df_dogs, df_cows])

From there split it into X and y

Sklearn: Take only few records from each target class

Answers (2)

Related Questions