Understanding set.seed() in R

Question

I am using R and I have some issues with replicating the output of the following:

mod1 <- glm(TVAR ~ .,data = df, family = "binomial")
y <- predict.glm(mod1)

due to its dependency on set.seed().

I have some questions related to this?

I am aware of the fact that if I preliminarily use set.seed(123) (or whatever other seed), the random generations will always start at the same level hence I will achieve a replicable result. Nevertheless, let's say that I want to reverse engineer the seed by starting from a good result and then retrieving the seed to replicate that good results the next time. In other words, let's assume that I run the same code n times without preliminarily setting a seed with the intention of finding the result that best fits me and then to retrieve the seed that was used. Would that be possible? It may sound like sort of cheating but it is indeed not, as I am just trying to pin down the results of the code on its seed-dependency, under the assumption that the overall idea behind the code is sensed and only needs to achieve a replicable status.
Just for my understanding: a new seed is only used when I delete all the variables in the environment? In fact, if I run the same code more than once but without cleaning the environment, the results are the same, hence the same seed was used. I would appreciate some clarity on this.
Lastly: is there a way to understand when a function is dependent on set.seed()? For instance, on the CRAN manual, I could not find any indication of this which seems to be a crucial issue.

Konrad Rudolph · Accepted Answer

To answer your points in turn:

Would [reverse engineering the seed from the result of a computation] be possible?

It depends on the actual random-number generator being used, but in general this is hard, because the state space of a good RNG is huge and you might have to search it exhaustively. Potentially this wouldn’t just take hours but years.

a new seed is only used when I delete all the variables in the environment?

A new seed is used whenever you invoke set.seed. The actual current seed value is stored in the hidden variable .Random.seed in the global environment. However, removing the seed won’t make your last computation reproducible, since R re-initialises the value of that seed based on a non-deterministic value (in actual fact, the current operating system time).

if I run the same code more than once but without cleaning the environment, the results are the same, hence the same seed was used.

No: consuming random values (by calling a stochastic function) changes the random seed. So running multiple computations in a row without cleaning the environment does not produce the same result. In fact, that would be terrible. You can see this easily yourself:

〉rnorm(1)
[1] -0.3156453
〉rnorm(1)
[1] 0.7345465

… clearly, two consecutive calls of a stochastic function (here, rnorm) did not produce the same result, even though I didn’t clean the environment in between calls.

is there a way to understand when a function is dependent on set.seed()?

You could set different random seeds, rerun the function, and see if the output changes.

Apart from that there is no general, straightforward way to do this. If the function does not document its dependency on set.seed, then your only recourse is to look carefully at the source code of the function (and all functions it calls in turn).

Bonus (as noted by Roland in the comments):

glm and predict.glm are not stochastic functions, and do not depend no set.seed.

Understanding set.seed() in R

Answers (2)

Related Questions