user1375871
user1375871

Reputation: 1249

Efficient random number generation from a truncated normal distribution

I would like to sample 50,000 values from the normal distribution with mean = 0 and sd -1. But I want to limit the values to [-3,3]. I've written code to do this, but not sure if it is most efficient? Was hoping to get some suggestions.

lower <- -3 
upper <- 3
x_norm<-rnorm(75000,0,1)
x_norm<-x_norm[which(x_norm >=lower & x_norm<=upper)]
repeat{
    x_norm<-c(x_norm, rnorm(10000,0,1))
    x_norm<-x_norm[which(x_norm >=lower & x_norm<=upper)]
    if(length(x_norm) >= 50000){break}
}
x_norm<-x_norm[1:50000]

Upvotes: 11

Views: 5654

Answers (3)

Greg Snow
Greg Snow

Reputation: 49640

John and Dirk gave nice examples of rejection sampling, which should be fine for the given question. But to give another approach, when you have the cumulative distribution function and its inverse (or reasonable approximations thereof) you can just generate data from a uniform distribution and transform:

x <- qnorm( runif(50000, pnorm(-3), pnorm(3)) )
range(x)
hist(x)

For the given question I don't expect this to be much better (if any better) than the rejection sampling methods, but if you wanted to generate data between 2 and 3 from a truncated normal 0,1 then this method would probably be much more efficient. It does depend on the cumulative and its inverse (pnorm and qnorm in this case) and so would not be as simple as the rejection sampling for a distribution without those easily available.

Upvotes: 12

Dirk is no longer here
Dirk is no longer here

Reputation: 368241

If you really care about efficiency this short piece of Rcpp code will be hard to beat. Store the following in a file, say /tmp/rnormClamp.cpp:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector rnormClamp(int N, int mi, int ma) {
    NumericVector X = rnorm(N, 0, 1);
    return clamp(mi, X, ma);
}

/*** R
  system.time(X <- rnormClamp(50000, -3, 3))
  summary(X)
*/

Use sourceCpp() (from Rcpp as well) to build and run it. The actual draw and clamping takes about 4 milliseconds on my computer:

R> sourceCpp("/tmp/rnormClamp.cpp")

R>   system.time(X <- rnormClamp(50000, -3, 3))
   user  system elapsed 
  0.004   0.000   0.004 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.00000 -0.67300 -0.00528  0.00122  0.68500  3.00000 
R> 

The clamp() sugar function was featured in this previous SO answer by Romain, which also notes that you want version 0.10.2 of Rcpp.

Edit: Per Ben's hint, I seemed to have misunderstood. Here is a mix of C++ and R:

// [[Rcpp::export]]
List rnormSelect(int N, int mi, int ma) {
  RNGScope scope;
  int N2 = N * 1.25;
  NumericVector X = rnorm(N2, 0, 1);
  LogicalVector ind = (X < mi) | (X > ma);
  return List::create(X, ind);
}

which one can append to the earlier file. Then:

R>   system.time({ Z <- rnormSelect(50000, -3, 3); 
+                  X <- Z[[1]][ ! Z[[2]] ]; X <- X[1:50000]})
   user  system elapsed 
  0.008   0.000   0.009 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.00000 -0.68200 -0.00066 -0.00276  0.66800  3.00000 
R> 

I'll revisit to the logical indexing and the row subset which I'll have to look up. Maybe tomorrow. But 9 milliseconds is still not too bad :)

Edit 2: Looks like we really don't have logical indexing. We'll have to add this. This version does it 'by hand' but is not that much faster than indexing from R:

// [[Rcpp::export]]
NumericVector rnormSelect2(int N, int mi, int ma) {
  RNGScope scope;
  int N2 = N * 1.25;
  NumericVector X = rnorm(N2, 0, 1);
  LogicalVector ind = (X >= mi) & (X <= ma);
  NumericVector Y(N);
  int k=0;
  for (int i=0; i<N2 & k<N; i++) {
    if (ind[i]) Y(k++) = X(i);
  }
  return Y;
}

And the output:

R>   system.time(X <- rnormSelect2(50000, -3, 3)) 
   user  system elapsed 
  0.004   0.000   0.007 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.99000 -0.66900 -0.00258  0.00223  0.66700  2.99000 

R>   length(X)
[1] 50000
R> 

Upvotes: 11

John
John

Reputation: 23758

Something like your code will most definitely work but you're greatly overestimating how many values you need. Given that it's a known distribution and a fairly large number of samples you know how many will appear more or less than 3.

(1-pnorm(3))*2 * 50000
[1] 134.9898

So, given you likely only get about 135 out of range in a draw of 50,000 it's pretty easy to draw a few more than that, but still not an inordinately larger number and trim it. Just take the first 50,000 of 50,500 that are less than or greater than 3.

x <- rnorm(50500)
x <- x[x < 3 & x > -3]
x <- x[1:50000]

I ran the first 2 lines 40,000 times and it returned a length greater than 50000 every time. A small boolean check could guarantee it always does.

x <- 1
while (length(x) < 50000){
    x <- rnorm(50500)
    x <- x[x < 3 & x > -3]}
x <- x[1:50000]

For me this executes almost 100% of the time in 6 ms. It's a simple way to do it in R that executes very fast, is easy to read, and doesn't require add ons.

Upvotes: 15

Related Questions