Role of random_state in train_test_split and classifiers

Question

Based on this answer: Random state (Pseudo-random number)in Scikit learn, if I use the same integer (say 42) as random_state, then each time it does train-test split, it should give the same split (i.e. same data instances in train during each run, and same for test)

But,

```
for test_size in test_sizes:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
    clf = SVC(C=penalty, probability=False)
```
Suppose I have a code like this. In this case, I am changing the test_size in each loop. How will it effect what random_state does? Will it shuffle everything OR keep as many rows intact as possible and shift a few rows from train to test (or vice versa) according to the test size?
Also, random_state is a parameter for some classifiers like sklearn.svm.SVC and sklearn.tree.DecisionTreeClassifier. I have a code like this:
```
clf = tree.DecisionTreeClassifier(random_state=0)
scores = cross_validate(clf, X_train, y_train, cv=cv)
cross_val_test_score = round(scores['test_score'].mean(), prec)
clf.fit(X_train, y_train)
```
What does random_state exactly do here? Because it is used while defining the classifier. It is not supplied with data yet. I got the following from http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html:

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Suppose the following line is executed multiple times for each of multiple test-sizes:
```
clf = tree.DecisionTreeClassifier(random_state=0)
```
If I keep random_state=int(test_size*100), does that mean that for each test-size, the results will come out to be the same? (and for different test-sizes, they will be different?)

(Here, tree.DecisionTreeClassifier could be replaced with other classifiers who also use random_state, such as sklearn.svm.SVC. I assume all classifier use random_state in a similar way?)

Chris Farr · Accepted Answer

1: Since you are changing the test size, the random state won't impact the selected rows between test-sizes and that wouldn't necessarily be desired behavior anyways since you are simply trying to get scores based on various sample sizes. What this will do for you, is allow you to compare models that use the input data, split by the same random state. The test sets will be the exact same from one loop run to the next. Allowing you to properly compare model performance on the same samples.

2: For models such as decision tree classifiers and many others, there are initialization parameters that are set at random. The random state here is ensuring that those parameters are set the exact same from one run to the next, creating reproducible behavior.

3: If the test size is different, and you multiply it by 100, then you will be creating different random states for each test set. But from one full run to the next it will create reproducible behavior. You could just as easily set a static value there.

Not all models use random state in the same way as each have different parameters that they are setting at random. For RandomForest, it's selecting random features.. for neural networks it's initializing random weights.. etc.

Role of random_state in train_test_split and classifiers

Answers (2)

Related Questions