Reputation: 267
I have a large dataset with multi-class classification ( 3 classes), I want to take sub-sample of data i.e take 200 records belonging to each class and the upon that data, I want split the data.
Say the 3 classes are cat
, dog
, cow
. I want to apply a split on the subset of data where there are 200 records selected out of a large dataset for each of the class cat
, dog
, cow
to train the ML model.
This is the code line for splitting the data:
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3,
random_state = 42)
How can I select X
and y
such that it has 200 records for each of the class?
Upvotes: 0
Views: 821
Reputation: 18647
You can groupby
the "class" column, then you have a couple of options:
If you want to shuffle and select a random 200 of each, use sample
aggregation.
df.groupby('class').sample(200, random_state=42)
If shuffling isn't necessary, you just need the first 200 of each, use head
aggregation.
df.groupby('class').head(200)
Upvotes: 3
Reputation: 2223
Take equal no. of samples based on each class
df_cats = df[df['class'] == 'cat'][:200]
df_dogs = df[df['class'] == 'dog'][:200]
df_cows = df[df['class'] == 'cow'][:200]
Then concatenate dataframes
df_new = pd.concat([df_cats, df_dogs, df_cows])
From there split it into X and y
Upvotes: 1