Reputation: 35
I am trying to understand how to generate a normal distribution if I already know specific percentiles.
A user gave a very comprehensive answer to a similar question (link here), but when I tried and tested it with my existing data, the variance was much too large.
How I did this:
x <- c(5,8,11)
PercRank <- c(2.1, 51.1, 98.8)
PercRank = 2.1 for example tells that 2.1% of the data has a value/score <= 5 (the first value of x). Similarly, PercRank = 51.1 tells that 51.1% of the data has a value/score <= 8.
I followed the method in this link. This is my code:
cum.p <- c(2.1, 51.1, 98.8)/100
prob <- c( cum.p[1], diff(cum.p), .01)
x <- c(5,8,11)
freq <- 1000 # final output size that we want
# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 1)
fin <- abs(max(x)) + 1
ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair
s <- sapply(2:length(ival), function(i) {
seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)
quantile(out, cum.p)
# 2% 51.1% 98.8%
# 5 8 11
c(mean(out), sd(out))
# [1] 7.834401 2.214227
All of this is from the comment (linked), and so far so good. Then I tried to check how well the generated normal distribution worked with my fitted values:
data.frame(sort(rnorm(1000, mean=mean(out), sd=sd(out))))
...
# 988 13.000904
# 989 13.028881
# 990 13.076649
...
# 1000 14.567080
I was concerned because the 988th value (e.g., 98.8% of 1000 samples) was 13.000904, while the value I had fitted for the 98.8% percentile was 11.0.
I re-generated the distribution many times and the variance was consistently larger than it needed to be.
I'm stumped. I would appreciate if anyone could show me a way to make the variance more accurate. Or, is this unavoidable?
(My first time posting here, I apologize if I broke rules - I can make it more clear if needed.)
Upvotes: 3
Views: 2288
Reputation: 132706
Why don't you treat this as an optimization problem?
x <- c(5,8,11)
PercRank <- c(2.1, 51.1, 98.8)
fun <- function(par, pq) {
sum((log(pq[,1]/100)-pnorm(pq[,2], mean=par[1], sd=par[2], log.p=TRUE))^2)
}
par.estimates <- optim(c(0,1), fn=fun, pq=cbind(PercRank, x))
pnorm(11, par.estimates[[1]][1], par.estimates[[1]][2])
#[1] 0.9816948
The result seems resonable, but there is some difference to the expected value for q=11. However, I suspect that's a problem of your data (e.g., due to rounding), since the following works well:
PercRank <- pnorm(x, 8, 2)*100
par.estimates <- optim(c(0,1), fn=fun, pq=cbind(PercRank, x))
par.estimates[[1]]
#[1] 7.999774 1.999953
Of course, there might be better optimizers for this specific problem.
Upvotes: 1