Reputation: 607
If you run code like:
length(unique(runif(10000000)))
length(unique(rnorm(10000000)))
you'll see that only about 99.8% of runif values are unique, but 100% of rnorm values are. I thought this might be because of the constrained range, but upping the range to (0, 100000) for runif doesn't change the result. Continuous distributions should have probability of repeats =0, and I know in floating-point precision that's not the case, but I'm curious why we don't see fairly close to the same number of repeats between the two.
Upvotes: 13
Views: 641
Reputation: 66844
This is due primarily to the properties of the default PRNG (the fact that runif
has a smaller range than rnorm
and therefore a smaller number of representable values may also have a similar effect at some point even if the RNG doesn't). It is discussed somewhat obliquely in ?Random
:
Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)
With the example:
sum(duplicated(runif(1e6))) # around 110 for default generator
## and we would expect about almost sure duplicates beyond about
qbirthday(1 - 1e-6, classes = 2e9) # 235,000
Changing to the Wichmann-Hill generator indeed reduces the chance of duplicates:
RNGkind("Wich")
sum(duplicated(runif(1e6)))
[1] 0
sum(duplicated(runif(1e8)))
[1] 0
Upvotes: 4
Reputation: 52008
The documentation for random number generations says:
Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)
By the birthday paradox you would expect to see repeated values in a set of more than roughly 2^16 values, and 10000000 > 2^16. I haven't found anything directly in the documentation about how many distinct values rnorm
will return, but it is presumably larger than 2^32. It is interesting to note that set.seed
has different parameters kind
which determines the uniform generator and normal.kind
which determines the normal generator, so the latter is not a simple transformation of the former.
Upvotes: 3