Reputation: 27
Can I use stratified sampling with 80% train 20% test split the data in python ?
I have looked into this and it is for kfold stratified sampling. I'm not sure if I just put 0 as the number of iteration would work, because it is implemented in the cross-validation package and they assumed at least 2 folds!
StratifiedShuffleSplit(labels=[0 0 1 1], n_iter=3, ...)
Upvotes: 0
Views: 560
Reputation: 3138
I'm not 100% sure what exactly your question is, so let's just review the details of sklearn.cross_validation.StratifiedShuffleSplit().
This cross-validation object is a merge of StratifiedKFold and ShuffleSplit.
This means that the function is going to return a randomized, stratified, fold. What determines the number of folds given back to you is the n_iter
parameter. If you set this to 0, then you will not receive anything in the functions response.
It's also possible that not all folds will be unique.
To answer what I think is your question Can I use stratified sampling with 80% train 20% test split the data in python?
Yes, let's look at the example code. By setting the test_size paramter to 0.2 (20%), you'll force your folds to make 80% training, and 20% testing.
import numpy as np
from sklearn.cross_validation import StratifiedShuffleSplit
X = np.array([[1, 1], [2, 2], [3, 3], [4, 4], [5,5], [6,6], [7,7], [8,8], [9,9], [10,10]])
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
sss = StratifiedShuffleSplit(y, 1, test_size=0.2, random_state=0)
for train_index, test_index in sss:
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
>>> TRAIN: [0 6 3 9 2 5 1 7] TEST: [4 8]
Please let me know if this is what you were looking for and let me know if you have any other questions.
Upvotes: 1