Michael P.
Michael P.

Reputation: 1

dataset train/test split code understanding

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

I am currently reading the book hands-on ML and I am having some issues with this code. I don't have much experience with python so that might be a reason but let me make my confusion clearer. In the book, the housing problem requires us to create stratums so the dataset has sufficient instances of each, and we do this with code that I didn't copy here, the code I am showing is used to create the test and train sets, using the specific income categories. The 1st and 2nd lines of code are clear, the 3rd is where I get lost. We create a split of test 0.2 train 0.8 but what exactly is happening from then on, what is the for loop used for?

I have looked in a couple of pages for info but haven't really found anything that made the situation clear, so I would really appreciate the help.

Thank you in advance for your answers.

Upvotes: 0

Views: 87

Answers (2)

Vítor Cézar
Vítor Cézar

Reputation: 269

StratifiedShuffleSplit is better if you are using a K-fold cross-validation, where you divide the training and testing data in different ways and then calculate the mean of a result in K iterations.

n_splits must equals the K value and in your case K is one, which makes no sense for cross-validation. I think you'd better use sklearn.model_selection.train_test_split, which makes more sense.

Upvotes: 0

Todd Burus
Todd Burus

Reputation: 983

That for loop is just taking the indices being used for the split and calling those rows of the original data to form the training and test sets.

Upvotes: 1

Related Questions