user3243499
user3243499

Reputation: 3151

How to do train test split such that there are enough training and testing data from each class in Python?

I have a dataset which is having 5 class and has a distribution as follows:

enter image description here

As is evident from the distribution that there are very less samples for class 1.

How do I do a train-test split of this data so that there are enough training and testing data from each category in Python?

Upvotes: 1

Views: 1522

Answers (2)

ResidentSleeper
ResidentSleeper

Reputation: 2495

Set stratify parameter in train_test_split to be your target column.

stratify will ensure that each class gets split equally. Doc

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

Upvotes: 2

Rubel Hassan
Rubel Hassan

Reputation: 186

train_test_split function shuffles the dataset before splitting by default unless you provide shuffle parameters value as False. And I think it ensures that you training portion of dataset will have values from all the categories if shuffle is True. Additionally if you want train_test_split's outcome deterministic then you can use random_state parameter. Please see the documentation to know more. Hope it helps.

Upvotes: -1

Related Questions