Reputation: 595
I have got the problem, that after splitting my data into training and test data one class is totally missing in my test set.
Example on 60/40 split: <br/>
Training: 'Orange', 0,0,0, 'Orange' <br/>
Testdata: 0,0,0,0,0
Obviously the word "orange" is not included in the test set. How can one ensure that the split is considering that at least some target samples are included in the test set as well as in the training set? I thought the stratify parameter would do this, but unfortunately does not it.
Upvotes: 3
Views: 2738
Reputation: 6260
As you are working with an imbalanced datasets, I would highly recommend you, not to manually adjust your balance in your classes and run cross validation instead: https://scikit-learn.org/stable/modules/cross_validation.html
This will give you stable parameters for the future and a better result. The idea is that your running over different folds and the train and test data as changing and your paremeters are adjusting therefore.
A small example:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1) #your classifier
scores = cross_val_score(clf, X, y, cv=5) #Assumming your features are X, and target is y
Upvotes: 0
Reputation: 33147
train_test_split
with stratify
input argument:import numpy as np
from sklearn.model_selection import train_test_split
X = np.arange(100).reshape((25, 4))
y= [0,1,2,3,4] * 5
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y)
print(y_train)
print(y_test)
[0, 4, 1, 4, 3, 2, 1, 1, 0, 4, 0, 2, 4, 3, 1, 2, 3]
[1, 4, 3, 2, 0, 0, 2, 3]
Upvotes: 0
Reputation: 764
1. Use below to split you train/test data - this uses stratify option of train_test_split
from sklearn.model_selection import train_test_split
train, test = train_test_split(X, test_size=0.25, stratify=y)
2. Or you could try using - stratified K fold this will apply k-fold cross validation
Upvotes: 1
Reputation: 101
You could split the data into two groups according to classes:
Group1: 'Orange', 'Orange'
Group2: 0,0,0,0,0,0,0,0
Do the split within each group, and put them back together like this.
mylist = [''Orange',0,0,0 ,'Orange',...]
Oranges = mylist[mylist=='Orange']
zeros = mylist[mylist==0]
orange_data = [O.X for O in Oranges]
orange_label = [O.y for o in Oranges]
Orange_data_train, orange_data_test, orange_label_train, orange_label_test = train_test_split(orange_data, orange_labels)
then do the same for zeros and put them together like:
training_data = Orange_data_train + zero_data_train
Be aware that many classification algorithms work best if the classes have similar sample size, but that's another topic.
Upvotes: 0