Michelle
Michelle

Reputation: 35

Generating a normal distribution from known percentiles

I am trying to understand how to generate a normal distribution if I already know specific percentiles.

A user gave a very comprehensive answer to a similar question (link here), but when I tried and tested it with my existing data, the variance was much too large.

How I did this:

x <- c(5,8,11)
PercRank <- c(2.1, 51.1, 98.8)

PercRank = 2.1 for example tells that 2.1% of the data has a value/score <= 5 (the first value of x). Similarly, PercRank = 51.1 tells that 51.1% of the data has a value/score <= 8.

I followed the method in this link. This is my code:

cum.p <- c(2.1, 51.1, 98.8)/100
prob <- c( cum.p[1], diff(cum.p), .01)
x <- c(5,8,11)

freq <- 1000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 1) 
fin  <- abs(max(x)) + 1

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
  seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

quantile(out, cum.p) 
# 2% 51.1% 98.8% 
# 5     8    11 

c(mean(out), sd(out))
# [1] 7.834401 2.214227

All of this is from the comment (linked), and so far so good. Then I tried to check how well the generated normal distribution worked with my fitted values:

data.frame(sort(rnorm(1000, mean=mean(out), sd=sd(out))))
...
# 988                                          13.000904
# 989                                          13.028881
# 990                                          13.076649
...
# 1000                                         14.567080

I was concerned because the 988th value (e.g., 98.8% of 1000 samples) was 13.000904, while the value I had fitted for the 98.8% percentile was 11.0.

I re-generated the distribution many times and the variance was consistently larger than it needed to be.

I'm stumped. I would appreciate if anyone could show me a way to make the variance more accurate. Or, is this unavoidable?

(My first time posting here, I apologize if I broke rules - I can make it more clear if needed.)

Upvotes: 3

Views: 2288

Answers (1)

Roland
Roland

Reputation: 132706

Why don't you treat this as an optimization problem?

x <- c(5,8,11)
PercRank <- c(2.1, 51.1, 98.8)

fun <- function(par, pq) {
  sum((log(pq[,1]/100)-pnorm(pq[,2], mean=par[1], sd=par[2], log.p=TRUE))^2)
}

par.estimates <- optim(c(0,1), fn=fun, pq=cbind(PercRank, x))

pnorm(11, par.estimates[[1]][1], par.estimates[[1]][2])
#[1] 0.9816948

The result seems resonable, but there is some difference to the expected value for q=11. However, I suspect that's a problem of your data (e.g., due to rounding), since the following works well:

PercRank <- pnorm(x, 8, 2)*100
par.estimates <- optim(c(0,1), fn=fun, pq=cbind(PercRank, x))
par.estimates[[1]]
#[1] 7.999774 1.999953

Of course, there might be better optimizers for this specific problem.

Upvotes: 1

Related Questions