dc95
dc95

Reputation: 1359

Role of random_state in train_test_split and classifiers

Based on this answer: Random state (Pseudo-random number)in Scikit learn, if I use the same integer (say 42) as random_state, then each time it does train-test split, it should give the same split (i.e. same data instances in train during each run, and same for test)

But,

  1. for test_size in test_sizes:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
        clf = SVC(C=penalty, probability=False)
    

    Suppose I have a code like this. In this case, I am changing the test_size in each loop. How will it effect what random_state does? Will it shuffle everything OR keep as many rows intact as possible and shift a few rows from train to test (or vice versa) according to the test size?

  2. Also, random_state is a parameter for some classifiers like sklearn.svm.SVC and sklearn.tree.DecisionTreeClassifier. I have a code like this:

    clf = tree.DecisionTreeClassifier(random_state=0)
    scores = cross_validate(clf, X_train, y_train, cv=cv)
    cross_val_test_score = round(scores['test_score'].mean(), prec)
    clf.fit(X_train, y_train)
    

    What does random_state exactly do here? Because it is used while defining the classifier. It is not supplied with data yet. I got the following from http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html:

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  1. Suppose the following line is executed multiple times for each of multiple test-sizes:

    clf = tree.DecisionTreeClassifier(random_state=0)
    

    If I keep random_state=int(test_size*100), does that mean that for each test-size, the results will come out to be the same? (and for different test-sizes, they will be different?)

    (Here, tree.DecisionTreeClassifier could be replaced with other classifiers who also use random_state, such as sklearn.svm.SVC. I assume all classifier use random_state in a similar way?)

Upvotes: 3

Views: 7677

Answers (2)

Acccumulation
Acccumulation

Reputation: 3591

You can check this with the code:

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 42,test_size = .3)
size25split = train_test_split(test_series,random_state = 42,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

This gives an output of 70, indicating that it just moved elements from the test set to the training set.

train_test_split creates a random permutation of the rows, and selects based on the first n rows of that permutation, where n is based on the test size.

What does random_state do here?

When the DecisionTreeClassifier object named clf is created, it's initialized with its random_state attribute set to 0. Note that if you type print(clf.random_state), the value 0 will be printed. When you call methods of clf, such as clf.fit, those methods may use the random_state attribute as a parameter.

Upvotes: 1

Chris Farr
Chris Farr

Reputation: 3759

1: Since you are changing the test size, the random state won't impact the selected rows between test-sizes and that wouldn't necessarily be desired behavior anyways since you are simply trying to get scores based on various sample sizes. What this will do for you, is allow you to compare models that use the input data, split by the same random state. The test sets will be the exact same from one loop run to the next. Allowing you to properly compare model performance on the same samples.

2: For models such as decision tree classifiers and many others, there are initialization parameters that are set at random. The random state here is ensuring that those parameters are set the exact same from one run to the next, creating reproducible behavior.

3: If the test size is different, and you multiply it by 100, then you will be creating different random states for each test set. But from one full run to the next it will create reproducible behavior. You could just as easily set a static value there.

Not all models use random state in the same way as each have different parameters that they are setting at random. For RandomForest, it's selecting random features.. for neural networks it's initializing random weights.. etc.

Upvotes: 2

Related Questions