EchoCache
EchoCache

Reputation: 595

Make data balanced after train test split operation (scikit)?

I have got the problem, that after splitting my data into training and test data one class is totally missing in my test set.

Example on 60/40 split: <br/>
Training: 'Orange', 0,0,0, 'Orange' <br/>
Testdata: 0,0,0,0,0 

Obviously the word "orange" is not included in the test set. How can one ensure that the split is considering that at least some target samples are included in the test set as well as in the training set? I thought the stratify parameter would do this, but unfortunately does not it.

Upvotes: 3

Views: 2738

Answers (4)

PV8
PV8

Reputation: 6260

As you are working with an imbalanced datasets, I would highly recommend you, not to manually adjust your balance in your classes and run cross validation instead: https://scikit-learn.org/stable/modules/cross_validation.html

enter image description here

This will give you stable parameters for the future and a better result. The idea is that your running over different folds and the train and test data as changing and your paremeters are adjusting therefore.

A small example:

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1) #your classifier
scores = cross_val_score(clf, X, y, cv=5) #Assumming your features are X, and target is y

Upvotes: 0

seralouk
seralouk

Reputation: 33147

Use train_test_split with stratify input argument:

import numpy as np
from sklearn.model_selection import train_test_split
X = np.arange(100).reshape((25, 4))
y= [0,1,2,3,4] * 5

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y)

print(y_train)
print(y_test)

[0, 4, 1, 4, 3, 2, 1, 1, 0, 4, 0, 2, 4, 3, 1, 2, 3]
[1, 4, 3, 2, 0, 0, 2, 3]

Upvotes: 0

Infinite
Infinite

Reputation: 764

1. Use below to split you train/test data - this uses stratify option of train_test_split

   from sklearn.model_selection import train_test_split 
   train, test = train_test_split(X, test_size=0.25, stratify=y) 

2. Or you could try using - stratified K fold this will apply k-fold cross validation

Upvotes: 1

haydard
haydard

Reputation: 101

You could split the data into two groups according to classes:

Group1: 'Orange', 'Orange'
Group2: 0,0,0,0,0,0,0,0

Do the split within each group, and put them back together like this.

mylist = [''Orange',0,0,0 ,'Orange',...]
Oranges = mylist[mylist=='Orange'] 
zeros = mylist[mylist==0]
orange_data = [O.X for O in Oranges]
orange_label = [O.y for o in Oranges]
Orange_data_train, orange_data_test, orange_label_train, orange_label_test = train_test_split(orange_data, orange_labels) 

then do the same for zeros and put them together like:

training_data = Orange_data_train + zero_data_train

Be aware that many classification algorithms work best if the classes have similar sample size, but that's another topic.

Upvotes: 0

Related Questions