Reputation: 1359
Based on this answer: Random state (Pseudo-random number)in Scikit learn, if I use the same integer (say 42) as random_state
, then each time it does train-test split, it should give the same split (i.e. same data instances in train during each run, and same for test)
But,
for test_size in test_sizes:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
clf = SVC(C=penalty, probability=False)
Suppose I have a code like this. In this case, I am changing the test_size
in each loop. How will it effect what random_state
does? Will it shuffle everything OR keep as many rows intact as possible and shift a few rows from train to test (or vice versa) according to the test size?
Also, random_state
is a parameter for some classifiers like sklearn.svm.SVC
and sklearn.tree.DecisionTreeClassifier
. I have a code like this:
clf = tree.DecisionTreeClassifier(random_state=0)
scores = cross_validate(clf, X_train, y_train, cv=cv)
cross_val_test_score = round(scores['test_score'].mean(), prec)
clf.fit(X_train, y_train)
What does random_state
exactly do here? Because it is used while defining the classifier. It is not supplied with data yet. I got the following from http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html:
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Suppose the following line is executed multiple times for each of multiple test-sizes:
clf = tree.DecisionTreeClassifier(random_state=0)
If I keep random_state=int(test_size*100)
, does that mean that for each test-size, the results will come out to be the same? (and for different test-sizes, they will be different?)
(Here, tree.DecisionTreeClassifier
could be replaced with other classifiers who also use random_state
, such as sklearn.svm.SVC
. I assume all classifier use random_state
in a similar way?)
Upvotes: 3
Views: 7677
Reputation: 3591
You can check this with the code:
import pandas as pd
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 42,test_size = .3)
size25split = train_test_split(test_series,random_state = 42,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))
This gives an output of 70, indicating that it just moved elements from the test set to the training set.
train_test_split
creates a random permutation of the rows, and selects based on the first n rows of that permutation, where n is based on the test size.
What does random_state do here?
When the DecisionTreeClassifier
object named clf
is created, it's initialized with its random_state
attribute set to 0. Note that if you type print(clf.random_state)
, the value 0
will be printed. When you call methods of clf, such as clf.fit
, those methods may use the random_state
attribute as a parameter.
Upvotes: 1
Reputation: 3759
1: Since you are changing the test size, the random state won't impact the selected rows between test-sizes and that wouldn't necessarily be desired behavior anyways since you are simply trying to get scores based on various sample sizes. What this will do for you, is allow you to compare models that use the input data, split by the same random state. The test sets will be the exact same from one loop run to the next. Allowing you to properly compare model performance on the same samples.
2: For models such as decision tree classifiers and many others, there are initialization parameters that are set at random. The random state here is ensuring that those parameters are set the exact same from one run to the next, creating reproducible behavior.
3: If the test size is different, and you multiply it by 100, then you will be creating different random states for each test set. But from one full run to the next it will create reproducible behavior. You could just as easily set a static value there.
Not all models use random state in the same way as each have different parameters that they are setting at random. For RandomForest, it's selecting random features.. for neural networks it's initializing random weights.. etc.
Upvotes: 2