Sam Mason
Sam Mason

Reputation: 16184

possible bug in `rbinom()` for large numbers of trials

I've been writing some code that iteratively performs binomial draws (using rbinom) and for some callee arguments I can end up with the size being large, which causes R (3.1.1, both official or homebrew builds tested—so unlikely to be compiler related) to return an unexpected NA. For example:

rbinom(1,2^32,0.95)

is what I'd expect to work, but gives NA back. However, running with size=2^31 or prob≤0.5 works.

The fine manual mentions inversion being used when size < .Machine$integer.max is false, could this be the issue?

Upvotes: 5

Views: 1373

Answers (4)

Ben Bolker
Ben Bolker

Reputation: 226332

It appears that, despite Martyn Plummer saying in his answer (from 2014) that returning an integer value (and NA-with-a-warning when the value overflowed) was the intended behaviour, this is no longer true in at least the development version of R (4.5.0/SVN r86945).

set.seed(101); str(rbinom(1,1000,0.95))
##  int 942
set.seed(101); str(rbinom(1,2^50,0.95))
##  num 1.07e+15

Upvotes: 1

Ben Bolker
Ben Bolker

Reputation: 226332

I agree that (in the absence of documentation saying this is a problem) that this is a bug. A reasonable workaround would be using the Normal approximation, which should be very very good indeed (and faster) for such large values. (I originally meant this to be short and simple but it ended up getting a little bit out of hand.)

rbinom_safe <- function(n,size,prob,max.size=2^31) {
    maxlen <- max(length(size),length(prob),n)
    prob <- rep(prob,length.out=maxlen)
    size <- rep(size,length.out=maxlen)
    res <- numeric(n)
    bigvals <- size>max.size
    if (nbig <- sum(bigvals>0)) {
        m <- (size*prob)[bigvals]
        sd <- sqrt(size*prob*(1-prob))[bigvals]
        res[bigvals] <- round(rnorm(nbig,mean=m,sd=sd))
    }
    if (nbig<n) {
        res[!bigvals] <- rbinom(n-nbig,size[!bigvals],prob[!bigvals])
    }
    return(res)
}

set.seed(101)
size <- c(1,5,10,2^31,2^32)
rbinom_safe(5,size,prob=0.95)
rbinom_safe(5,3,prob=0.95)
rbinom_safe(5,2^32,prob=0.95)

The Normal approximation should work reasonably well whenever the mean is many standard deviations away from 0 or 1 (whichever is closer). For large N this should be OK unless p is very extreme. For example:

n <- 2^31
p <- 0.95
m <- n*p
sd <- sqrt(n*p*(1-p))
set.seed(101)
rr <- rbinom_safe(10000,n,prob=p)
hist(rr,freq=FALSE,col="gray",breaks=50)
curve(dnorm(x,mean=m,sd=sd),col=2,add=TRUE)
dd <- round(seq(m-5*sd,m+5*sd,length.out=101))
midpts <- (dd[-1]+dd[-length(dd)])/2
lines(midpts,c(diff(sapply(dd,pbinom,size=n,prob=p))/diff(dd)[1]),
      col="blue",lty=2)

enter image description here

Upvotes: 3

Martyn Plummer
Martyn Plummer

Reputation: 341

This is the intended behaviour, but there are two issues: 1) The NA induced by coercion should raise a warning 2) The fact that discrete random variables have storage mode integer should be documented.

I have fixed 1) and will modify the documentation to fix 2) when I have a little more time.

Upvotes: 3

Roland
Roland

Reputation: 132706

Looking at the source rbinom does the equivalent (in C code) of the following for such large sizes:

qbinom(runif(n), size, prob, FALSE)

And indeed:

set.seed(42)
rbinom(1,2^31,0.95)
#[1] 2040095619
set.seed(42)
qbinom(runif(1), 2^31, 0.95, F)
#[1] 2040095619

However:

set.seed(42)
rbinom(1,2^32,0.95)
#[1] NA
set.seed(42)
qbinom(runif(1), 2^32, 0.95, F)
#[1] 4080199349

As @BenBolker points out rbinom returns an integer and if the return value is larger than .Machine$integer.max, e.g., larger than 2147483647 on my machine, NA gets returned. In contrast qbinom returns a double. I don't know why and it doesn't seem to be documented.

So, it seems like there is at least undocumented behavior and you should probably report it.

Upvotes: 5

Related Questions