Dominique Makowski
Dominique Makowski

Reputation: 1673

Generate a perfectly normally distributed sample of size n in R

I would like to generate a sample of mean = 0, sd = 1 and size n = 100 which distribution is as normal as possible. Using rnorm alone returns a lot of variability.

The only way I found was to average multiple rnorms.

rowMeans(replicate(10000, sort(rnorm(100, 0, 1))))

This returns a rather satisfying result, but I'm not sure it's the most efficient way of doing it.


EDIT:

I don't want the mean and sd to be strictly equal to 0 and 1, but rather, the distributin to "look" like a normal distribution (when plotting the density curve).

It seems that the qnorm method works worse than the "average" method:

# qnorm method
x <- qnorm(seq(.00001, .99999, length.out = 100), mean=0, sd=1)
plot(density(x))

# average method
x <- rowMeans(replicate(10000, sort(rnorm(100, mean=0, sd=1))))
plot(density(x))

enter image description here enter image description here

I would be pleased with a deterministic solution returning results close to the average method in a more efficient way.


EDIT 2 : Possible solution

Based on the answers, the following seems to work, adjusting the bounds relatively to n:

x <- qnorm(seq(1/n, 1-1/n, length.out = n), mean=0, sd=1)

Below a comparison of the qnorm and average methods for different values of n:

par(mfrow=c(6,2))
for(n in c(10, 20, 100, 500, 1000, 9876)){
  x <- qnorm(seq(1/n, 1-1/n, length.out = n), mean=0, sd=1)
  plot(density(x), col="blue", lwd=2)

  x <- rowMeans(replicate(10000, sort(rnorm(n, mean=0, sd=1))))
  plot(density(x), col="red", lwd=2)
}

enter image description here

Upvotes: 1

Views: 1152

Answers (4)

gaspar
gaspar

Reputation: 1078

You can use the bayestestR package:

library(bayestestR)
x <-  rnorm_perfect(n = 100, mean = 0, sd = 1)
plot(density(x))

enter image description here

Upvotes: 1

Alb_Sim
Alb_Sim

Reputation: 49

Low discrepancy sequence? halton, faure, sobol, hammersley: example:

library(randtoolbox)

sequence <-sobol(n=100, dim = 1, init = TRUE, scrambling = 0, seed = 4711, normal = FALSE)
mean(sequence)
[1] 0.4982031
sd(sequence)
[1] 0.2860574

#trial with prng
set.seed(1) 
sequence2 <- runif(100)
mean(sequence2)
[1] 0.5178471
sd(sequence2)
[1] 0.2675848

with the same amount of points low discrepancy sequence are better than pseudoramdom generator, keep in mind that for uniform random sample the true mean is 0.5, sd is 0.2886751 (sqrt(1/12)), look at the numbers.

(mean(sequence) - 0.5)/0.5   #  -0.0008984375
(mean(sequence2) - 0.5)/0.5  #  -0.008923532
(sd(sequence) - sqrt(1/12))*sqrt(12)
[1] -0.009067992
(sd(sequence2) - sqrt(1/12))*sqrt(12)
[1] -0.07305918

~10 times better, try with other seed if you do not believe it

ks.test(sequence,"runif")

    One-sample Kolmogorov-Smirnov test

data:  sequence
D = 0.96268, p-value < 2.2e-16
alternative hypothesis: two-sided

> ks.test(sequence2,"runif")

    One-sample Kolmogorov-Smirnov test

data:  sequence2
D = 0.93956, p-value < 2.2e-16
alternative hypothesis: two-sided

Now some balancing:

    sequence <- c(sequence, 1.0 - sequence)  #balancing the mean = use antithetics
    #or if you want (sequence <- sequence - mean(sequence))
    normal_sample <- qnorm(sequence)
    normal_sample <- normal_sample/sd(normal_sample)
    plot(normal_sample)

Upvotes: 0

Melissa Key
Melissa Key

Reputation: 4551

If you want a deterministic solution, this should work

qnorm(seq(0.01, 0.99, length.out = 100))

Note that qnorm(0) gives $-\infty$ and qnorm(1) is $\infty$, so you need to find some reasonable bounds.

For n=100, the bounds 0.01 and 0.99 appear to work best. If you want the bounds to be farther out for the deterministic solution, you'd need to increase n.

Upvotes: 5

Spacedman
Spacedman

Reputation: 94277

Are you trying to create 100 numbers with an approximate normal distribution with mean exactly zero and sd exactly one? Do this:

Start roughly:

> X = rnorm(100)

Shift them:

> X = X-mean(X)

Scale them:

> X = X/sd(X)

Check it:

> mean(X)
[1] -7.223497e-18

near enough

> sd(X)
[1] 1

bang on.

This is the same as what the scale function does:

> X = rnorm(100)
> mean(X)
[1] -0.007667039
> sd(X)
[1] 0.9336842
> sx = scale(X)
> mean(sx)
[1] 1.437056e-17
> sd(sx)
[1] 1

Upvotes: 6

Related Questions