Vikas
Vikas

Reputation: 61

random_state parameter in sklearn's train_test_split

What difference does different values of random state makes to the output? For instance, if I set 0 and if I set 100 what difference would it make to the output?

Upvotes: 1

Views: 3987

Answers (2)

Brad Solomon
Brad Solomon

Reputation: 40878

Passing different integers to random_state seeds NumPy's pseudo random number generator with those values and makes the resulting "random" train and test data reproducible. This means that if you pass the function array a with random_seed=0, using that 0 seed value will always result in the same train and test data.

When you pass an integer, the value eventually gets pass to scklearn.utils.check_random_state(), which becomes:

if isinstance(seed, (numbers.Integral, np.integer)):
    return np.random.RandomState(seed)

This in turn is used by a class like ShuffleSplit to call a random permutation:

rng = check_random_state(self.random_state)
for i in range(self.n_splits):
    # random partition
    permutation = rng.permutation(n_samples)
    ind_test = permutation[:n_test]
    ind_train = permutation[n_test:(n_test + n_train)]
    yield ind_train, ind_test

Here's an example using the actual method that is used:

>>> np.random.RandomState(0).permutation([1, 4, 9, 12, 15])
array([ 9,  1,  4, 12, 15])
>>> np.random.RandomState(0).permutation([1, 4, 9, 12, 15])
array([ 9,  1,  4, 12, 15])
>>> np.random.RandomState(0).permutation([1, 4, 9, 12, 15])
array([ 9,  1,  4, 12, 15])
>>> np.random.RandomState(100).permutation([1, 4, 9, 12, 15])
array([ 4,  9, 12, 15,  1])
>>> np.random.RandomState(100).permutation([1, 4, 9, 12, 15])
array([ 4,  9, 12, 15,  1])
>>> np.random.RandomState(100).permutation([1, 4, 9, 12, 15])
array([ 4,  9, 12, 15,  1])

Upvotes: 0

Tim
Tim

Reputation: 10709

From the docs:

The random_state is the seed used by the random number generator.

In general a seed is used to create reproducible outputs. In the case of train_test_split the random_state determines how your data set is split. Unless you want to create reproducible runs, you can skip this parameter.

For instance, if is set 0 and if i set 100 what difference would it make to the output ?

You will always get the same train/test split for a specific seed. Different seeds will result in a different train/test split.

Upvotes: 3

Related Questions