Reputation: 295
I am using R and I have some issues with replicating the output of the following:
mod1 <- glm(TVAR ~ .,data = df, family = "binomial")
y <- predict.glm(mod1)
due to its dependency on set.seed()
.
I have some questions related to this?
I am aware of the fact that if I preliminarily use set.seed(123)
(or whatever other seed), the random generations will always start at the same level hence I will achieve a replicable result. Nevertheless, let's say that I want to reverse engineer the seed by starting from a good result and then retrieving the seed to replicate that good results the next time. In other words, let's assume that I run the same code n
times without preliminarily setting a seed with the intention of finding the result that best fits me and then to retrieve the seed that was used. Would that be possible? It may sound like sort of cheating but it is indeed not, as I am just trying to pin down the results of the code on its seed-dependency, under the assumption that the overall idea behind the code is sensed and only needs to achieve a replicable status.
Just for my understanding: a new seed is only used when I delete all the variables in the environment? In fact, if I run the same code more than once but without cleaning the environment, the results are the same, hence the same seed was used. I would appreciate some clarity on this.
Lastly: is there a way to understand when a function is dependent on set.seed()
? For instance, on the CRAN manual, I could not find any indication of this which seems to be a crucial issue.
Upvotes: 0
Views: 6634
Reputation: 44867
@KonradRudolph gives a good answer here. I'd just like to add one point to it:
There are three ways to set the random seed, and they are not the same:
set.seed(n)
sets it to an easily reproducible state.In general, set.seed()
can only output a tiny fraction of the possible values of the random seed, whereas calling the RNG should (eventually) cycle through all of them. There are about 2^20000
different random seeds possible, but set.seed()
can only create about 2^32
of them. (Both of these numbers are over-estimates, but the ratio is about right.)
You can save and restore the .Random.seed
variable, or call set.seed(n)
to set the random seed to a known state. The only feasible way to reproduce a particular state is to start in a known state and repeat the calls that led to the one you want.
Upvotes: 2
Reputation: 545588
To answer your points in turn:
Would [reverse engineering the seed from the result of a computation] be possible?
It depends on the actual random-number generator being used, but in general this is hard, because the state space of a good RNG is huge and you might have to search it exhaustively. Potentially this wouldn’t just take hours but years.
a new seed is only used when I delete all the variables in the environment?
A new seed is used whenever you invoke set.seed
. The actual current seed value is stored in the hidden variable .Random.seed
in the global environment. However, removing the seed won’t make your last computation reproducible, since R re-initialises the value of that seed based on a non-deterministic value (in actual fact, the current operating system time).
if I run the same code more than once but without cleaning the environment, the results are the same, hence the same seed was used.
No: consuming random values (by calling a stochastic function) changes the random seed. So running multiple computations in a row without cleaning the environment does not produce the same result. In fact, that would be terrible. You can see this easily yourself:
〉rnorm(1)
[1] -0.3156453
〉rnorm(1)
[1] 0.7345465
… clearly, two consecutive calls of a stochastic function (here, rnorm
) did not produce the same result, even though I didn’t clean the environment in between calls.
is there a way to understand when a function is dependent on
set.seed()
?
You could set different random seeds, rerun the function, and see if the output changes.
Apart from that there is no general, straightforward way to do this. If the function does not document its dependency on set.seed
, then your only recourse is to look carefully at the source code of the function (and all functions it calls in turn).
Bonus (as noted by Roland in the comments):
glm
and predict.glm
are not stochastic functions, and do not depend no set.seed
.
Upvotes: 1