Reputation: 3151
I have a dataset which is having 5 class and has a distribution as follows:
As is evident from the distribution that there are very less samples for class 1
.
How do I do a train-test split of this data so that there are enough training and testing data from each category in Python?
Upvotes: 1
Views: 1522
Reputation: 2495
Set stratify
parameter in train_test_split
to be your target column.
stratify
will ensure that each class gets split equally. Doc
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
Upvotes: 2
Reputation: 186
train_test_split
function shuffles the dataset before splitting by default unless you provide shuffle
parameters value as False. And I think it ensures that you training portion of dataset will have values from all the categories if shuffle
is True. Additionally if you want train_test_split's outcome deterministic then you can use random_state
parameter. Please see the documentation to know more. Hope it helps.
Upvotes: -1