Hands On Machine Learning California Housing problem

Question

Working through the book, and I see this bit on scikit-learn:

housing["income_cat"] = pd.cut(housing["median_income"], bins=[0.,1.5,3.0,4.5,6.,np.inf], labels=[1,2,3,4,5])

split =StratifiedShuffleSplit(n_splits=1, test_size=0.2, randomstate=42)

for train_index, test_index in split.split(housing, housing["income_cat"])
    stat_train_set = housing.loc[train_index]
    stat_test_set = housing.loc[test_index]

I get that the first line is adding a column to the housing dataframe and attaching a bin 1 - 5 categorizing the income.

1: 0-<1.5
2: 1.5-<3.0
3: 3.0-<4.5
4: 4.5-<6
5: >6

I understand the second line returns a function to split.

What I don't understand is how the function knows which of the two indices is the 20%? Is the second index always the one that the function applies the test_size parameter to?

jottbe · Accepted Answer

You only need to know that the split method produces an iterator and this iterator yields a tuple of indices. The first element of this tuple are the train indices, the second are the test indices. There is no magic behind this.

If you like to check the source code of the method, you can find it here. Especially look at the end of method _iter_indices. There you see the yield statement, that produces this tuple.

Hands On Machine Learning California Housing problem

Answers (1)

Related Questions