Vitomir
Vitomir

Reputation: 295

Understanding set.seed() in R

I am using R and I have some issues with replicating the output of the following:

mod1 <- glm(TVAR ~ .,data = df, family = "binomial")
y <- predict.glm(mod1)

due to its dependency on set.seed().

I have some questions related to this?

Upvotes: 0

Views: 6634

Answers (2)

user2554330
user2554330

Reputation: 44867

@KonradRudolph gives a good answer here. I'd just like to add one point to it:

There are three ways to set the random seed, and they are not the same:

  • Using set.seed(n) sets it to an easily reproducible state.
  • Calling any of the internal random number generators also changes it, in a deterministic but less predictable way.
  • Saving it and restoring it later sets it to the earlier state.

In general, set.seed() can only output a tiny fraction of the possible values of the random seed, whereas calling the RNG should (eventually) cycle through all of them. There are about 2^20000 different random seeds possible, but set.seed() can only create about 2^32 of them. (Both of these numbers are over-estimates, but the ratio is about right.)

You can save and restore the .Random.seed variable, or call set.seed(n) to set the random seed to a known state. The only feasible way to reproduce a particular state is to start in a known state and repeat the calls that led to the one you want.

Upvotes: 2

Konrad Rudolph
Konrad Rudolph

Reputation: 545588

To answer your points in turn:

Would [reverse engineering the seed from the result of a computation] be possible?

It depends on the actual random-number generator being used, but in general this is hard, because the state space of a good RNG is huge and you might have to search it exhaustively. Potentially this wouldn’t just take hours but years.

a new seed is only used when I delete all the variables in the environment?

A new seed is used whenever you invoke set.seed. The actual current seed value is stored in the hidden variable .Random.seed in the global environment. However, removing the seed won’t make your last computation reproducible, since R re-initialises the value of that seed based on a non-deterministic value (in actual fact, the current operating system time).

if I run the same code more than once but without cleaning the environment, the results are the same, hence the same seed was used.

No: consuming random values (by calling a stochastic function) changes the random seed. So running multiple computations in a row without cleaning the environment does not produce the same result. In fact, that would be terrible. You can see this easily yourself:

〉rnorm(1)
[1] -0.3156453
〉rnorm(1)
[1] 0.7345465

… clearly, two consecutive calls of a stochastic function (here, rnorm) did not produce the same result, even though I didn’t clean the environment in between calls.

is there a way to understand when a function is dependent on set.seed()?

You could set different random seeds, rerun the function, and see if the output changes.

Apart from that there is no general, straightforward way to do this. If the function does not document its dependency on set.seed, then your only recourse is to look carefully at the source code of the function (and all functions it calls in turn).


Bonus (as noted by Roland in the comments):

glm and predict.glm are not stochastic functions, and do not depend no set.seed.

Upvotes: 1

Related Questions