jntrcs
jntrcs

Reputation: 607

Why does runif() have less unique values than rnorm()?

If you run code like:

length(unique(runif(10000000)))
length(unique(rnorm(10000000)))

you'll see that only about 99.8% of runif values are unique, but 100% of rnorm values are. I thought this might be because of the constrained range, but upping the range to (0, 100000) for runif doesn't change the result. Continuous distributions should have probability of repeats =0, and I know in floating-point precision that's not the case, but I'm curious why we don't see fairly close to the same number of repeats between the two.

Upvotes: 13

Views: 641

Answers (2)

James
James

Reputation: 66844

This is due primarily to the properties of the default PRNG (the fact that runif has a smaller range than rnorm and therefore a smaller number of representable values may also have a similar effect at some point even if the RNG doesn't). It is discussed somewhat obliquely in ?Random:

Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)

With the example:

sum(duplicated(runif(1e6))) # around 110 for default generator
## and we would expect about almost sure duplicates beyond about
qbirthday(1 - 1e-6, classes = 2e9) # 235,000

Changing to the Wichmann-Hill generator indeed reduces the chance of duplicates:

RNGkind("Wich")  
sum(duplicated(runif(1e6)))
[1] 0
sum(duplicated(runif(1e8)))
[1] 0

Upvotes: 4

John Coleman
John Coleman

Reputation: 52008

The documentation for random number generations says:

Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)

By the birthday paradox you would expect to see repeated values in a set of more than roughly 2^16 values, and 10000000 > 2^16. I haven't found anything directly in the documentation about how many distinct values rnorm will return, but it is presumably larger than 2^32. It is interesting to note that set.seed has different parameters kind which determines the uniform generator and normal.kind which determines the normal generator, so the latter is not a simple transformation of the former.

Upvotes: 3

Related Questions