blue-sky
blue-sky

Reputation: 53806

What is random_state parameter in scikit-learn TSNE?

According to http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html random_state is

random_state : int or RandomState instance or None (default) Pseudo Random Number generator seed control. If None, use the numpy.random singleton. Note that different initializations might result in different local minima of the cost function.

What state is being seeded ? How does this affect the tsne implementation ? This parameter is not mentioned in tsne paper : http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Update 1 :

While Classification results depend on random_state? certainly helps to explain why random state is used in sklearn it does not clarify how random state is used in sklearn tsne algorithm implementation.

Upvotes: 3

Views: 12362

Answers (2)

rriccilopes
rriccilopes

Reputation: 387

It's being used on the PCA (to reduce the data dimensionality) and to initialize the Embedding of the training data.

You can check the code by yourself https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/manifold/t_sne.py#L777

You can also try to read more about the method.

Edit1: It may (or not) impact the results directly. I suggest to you set a random seed and always use that.

Upvotes: 3

Vivek Kumar
Vivek Kumar

Reputation: 36599

The use of random_state is explained pretty well in the post I commented. As for this specific case of TSNE, random_state is used to seed the cost_function of the algorithm.

As documented:

method : string (default: ‘barnes_hut’)

By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time

Also, search the term "random" in the paper you cited. The first line is

The gradient descent is initialized by sampling map points randomly from an isotropic Gaussian with small variance that is centered around the origin.

Also other locations of word "random" clarifies that there is randomness is choosing the starting landmark points, and hence can affect the local minima of function.

This randomness is represented by a pseudorandom number generator, which is seeded by random_state parameter.

Explanation: Some algorithms use the random numbers in initialization of certain parameters, such as weights for optimizing, splitting of data randomly into train and test, choosing some features etc.

Now in programming and software in general, nothing is inherently truly random. To generate random numbers, a program is used. But since its a program with some fixed steps, it cannot be truly random. So its called pseudorandom generators. Now to output different sequence of numbers each time, they take an input according to which numbers are generated. Typically, this input is the current time in milliseconds (Epochs UTC). This input is called seed. Fixing the seed means to fix the output numbers.

random_state is used as seed for pseudorandom number generator in scikit-learn to duplicate the behavior when such randomness is involved in algorithms. When a fixed random_state, it will produce exact same results in different runs of the program. So its easier to debug and identify problems, if any. Without setting the random_state, different seeds will be used each time that algorithm is run and you will get different results. It may happen that you may get very high scores first time and can never be able to achieve that again.

Now in machine learning, we want to replicate our steps exactly same as performed before, to analyse the results. Hence random_state is fixed to some integer. Hope it helps.

Upvotes: 4

Related Questions