Ranyk
Ranyk

Reputation: 267

Sklearn: Take only few records from each target class

I have a large dataset with multi-class classification ( 3 classes), I want to take sub-sample of data i.e take 200 records belonging to each class and the upon that data, I want split the data.

Say the 3 classes are cat, dog, cow. I want to apply a split on the subset of data where there are 200 records selected out of a large dataset for each of the class cat, dog, cow to train the ML model.

This is the code line for splitting the data:

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3, 
                                                          random_state = 42)

How can I select X and y such that it has 200 records for each of the class?

Upvotes: 0

Views: 821

Answers (2)

Chris Adams
Chris Adams

Reputation: 18647

You can groupby the "class" column, then you have a couple of options:

  1. If you want to shuffle and select a random 200 of each, use sample aggregation.

    df.groupby('class').sample(200, random_state=42)
    
  2. If shuffling isn't necessary, you just need the first 200 of each, use head aggregation.

    df.groupby('class').head(200)
    

Upvotes: 3

imdevskp
imdevskp

Reputation: 2223

Take equal no. of samples based on each class

df_cats = df[df['class'] == 'cat'][:200]
df_dogs = df[df['class'] == 'dog'][:200]
df_cows = df[df['class'] == 'cow'][:200]

Then concatenate dataframes

df_new = pd.concat([df_cats, df_dogs, df_cows])

From there split it into X and y

Upvotes: 1

Related Questions