Reputation: 290
I've been stuck with this for a while, so I decided to write a question.
Problem: How to generate a random sample (of lenght n) with a lower/upper bound and arround a specific mean.
Observation: distribution doesn't need to be specific (it could be normal, beta, etc).
Aproaches considered:
rtnorm
function (package msm) which generates a random numbers with a normal distribution within specified bounds but it doesn't hold your wanted mean value.A second aproach I've tried is this function which I found in a question I can't find anymore
rBootstrap <- function(n, mean, sd, lowerBound, upperBound){
range <- upperBound - lowerBound
m <- (mean-lowerBound) / range #mapping mean to 0-1 range
s <- sd / range #mapping sd to 0-1 range
a <- (m^2 - m^3 - m*s^2)/s^2 #calculating alpha for rbeta
b <- (m-2*m^2+m^3-s^2+m*s^2)/s^2 #calculating beta for rbeta
data <- rbeta(n,a,b) #generating data
data <- lowerBound + data * range #remaping to given bounds
return(data)
}
this function actually gives great results unless: upperBound > lowerBound + (2* mean - lowerBound) (upper bound exceeds two times the distance from the lowerBound to the mean).
Particularly, I would like to generate a random sample of lenght 1,800, with values between 50,000 and 250,000 with mean value = 70,000.
Upvotes: 2
Views: 1925
Reputation: 20080
You should use truncated normal distribution, but mean
should be recalibrated. If you look at mean
in rtnorm
, it is clearly stated: mean
is the mean of the original Normal distribution before truncation.
If you want OBSERVABLE mean to be equal to desired value, just use formula from Truncated Normal:
mu = E + sigma*(f(b) - f(a))/(F(b) - F(a))
Here E
is what mean value you want to have (70,000 in your case), f(x)
being gaussian density, F(x)
being cumulative function, a
and b
being interval boundaries (centered and scaled).
a = (LB - mu)/sigma
b = (RB - mu)/sigma
After you computed mu
, pass it down to rtnorm as mean
parameter.
NB: you might want to do similar exercise with sigma
- what's going into rtnorm is NOT what you're going to observe in sampling, see again wiki reference
UPDATE
Ok, got to the code myself, though first cut is done in Python (looking into R) right now. Problem is, for given observable mean mu
is in f(a)
, in f(b)
, in F(a)
and in F(b)
which converts the problem into search of the root of the non-linear equation. But it is solvable, please check the code. Note, it follows pretty much wiki notation.
For example for your parameters and sigma=12,000, I got
Found mu = 68430.372119287 for the desired mean 70000.0 and sigma 12000.0
Sampled 100000 truncated gaussians and got observed mean = 70023.15990337673
For your parameters and sigma=24,000, I got
Found mu = 52275.475000378945 for the desired mean 70000.0 and sigma 24000.0
Sampled 100000 truncated gaussians and got observed mean = 69922.16000288539
So mu
is getting pretty close to the left boundary for large sigma
, which is expected behavior, but observed mean stays close to 70,000, which is what you want.
UPDATE II
Here is R code, in github repo as well
require(rootSolve)
require(msm)
phi <- function(z) {
dnorm(z)
}
Phi <- function(z) {
pnorm(z)
}
Mean <- function(mu, sigma, a, b) {
alfa <- (a - mu) / sigma
beta <- (b - mu) / sigma
Z <- Phi(beta) - Phi(alfa)
mu + sigma*(phi(alfa) - phi(beta))/Z
}
f <- function(mu, mean, sigma, a, b) {
mean - Mean(mu, sigma, a, b)
}
a <- 50000.0
b <- 250000.0
mean <- 70000.0
sigma <- 24000.0
# find mu for desired mean
q <- uniroot(f, c(a, b), mean, sigma, a, b)
mu <- q$root
print(sprintf("Found mu = %f for the desired mean %f and sigma %f", mu, mean, sigma))
# sampling test
set.seed(32345)
N = 100000
r <- rtnorm(N, mean=mu, sd=sigma, lower=a, upper=b)
print(sprintf("Sampled %d truncated gaussians and got observed mean = %f", N, mean(r)))
Upvotes: 2