Reputation: 16184
I've been writing some code that iteratively performs binomial draws (using rbinom
) and for some callee arguments I can end up with the size being large, which causes R (3.1.1, both official or homebrew builds tested—so unlikely to be compiler related) to return an unexpected NA
. For example:
rbinom(1,2^32,0.95)
is what I'd expect to work, but gives NA
back. However, running with size=2^31
or prob≤0.5
works.
The fine manual mentions inversion being used when size < .Machine$integer.max
is false, could this be the issue?
Upvotes: 5
Views: 1373
Reputation: 226332
It appears that, despite Martyn Plummer saying in his answer (from 2014) that returning an integer value (and NA
-with-a-warning when the value overflowed) was the intended behaviour, this is no longer true in at least the development version of R (4.5.0/SVN r86945).
set.seed(101); str(rbinom(1,1000,0.95))
## int 942
set.seed(101); str(rbinom(1,2^50,0.95))
## num 1.07e+15
Upvotes: 1
Reputation: 226332
I agree that (in the absence of documentation saying this is a problem) that this is a bug. A reasonable workaround would be using the Normal approximation, which should be very very good indeed (and faster) for such large values. (I originally meant this to be short and simple but it ended up getting a little bit out of hand.)
rbinom_safe <- function(n,size,prob,max.size=2^31) {
maxlen <- max(length(size),length(prob),n)
prob <- rep(prob,length.out=maxlen)
size <- rep(size,length.out=maxlen)
res <- numeric(n)
bigvals <- size>max.size
if (nbig <- sum(bigvals>0)) {
m <- (size*prob)[bigvals]
sd <- sqrt(size*prob*(1-prob))[bigvals]
res[bigvals] <- round(rnorm(nbig,mean=m,sd=sd))
}
if (nbig<n) {
res[!bigvals] <- rbinom(n-nbig,size[!bigvals],prob[!bigvals])
}
return(res)
}
set.seed(101)
size <- c(1,5,10,2^31,2^32)
rbinom_safe(5,size,prob=0.95)
rbinom_safe(5,3,prob=0.95)
rbinom_safe(5,2^32,prob=0.95)
The Normal approximation should work reasonably well whenever the mean is many standard deviations away from 0 or 1 (whichever is closer). For large N this should be OK unless p is very extreme. For example:
n <- 2^31
p <- 0.95
m <- n*p
sd <- sqrt(n*p*(1-p))
set.seed(101)
rr <- rbinom_safe(10000,n,prob=p)
hist(rr,freq=FALSE,col="gray",breaks=50)
curve(dnorm(x,mean=m,sd=sd),col=2,add=TRUE)
dd <- round(seq(m-5*sd,m+5*sd,length.out=101))
midpts <- (dd[-1]+dd[-length(dd)])/2
lines(midpts,c(diff(sapply(dd,pbinom,size=n,prob=p))/diff(dd)[1]),
col="blue",lty=2)
Upvotes: 3
Reputation: 341
This is the intended behaviour, but there are two issues: 1) The NA induced by coercion should raise a warning 2) The fact that discrete random variables have storage mode integer should be documented.
I have fixed 1) and will modify the documentation to fix 2) when I have a little more time.
Upvotes: 3
Reputation: 132706
Looking at the source rbinom
does the equivalent (in C code) of the following for such large sizes:
qbinom(runif(n), size, prob, FALSE)
And indeed:
set.seed(42)
rbinom(1,2^31,0.95)
#[1] 2040095619
set.seed(42)
qbinom(runif(1), 2^31, 0.95, F)
#[1] 2040095619
However:
set.seed(42)
rbinom(1,2^32,0.95)
#[1] NA
set.seed(42)
qbinom(runif(1), 2^32, 0.95, F)
#[1] 4080199349
As @BenBolker points out rbinom
returns an integer and if the return value is larger than .Machine$integer.max
, e.g., larger than 2147483647
on my machine, NA
gets returned. In contrast qbinom
returns a double. I don't know why and it doesn't seem to be documented.
So, it seems like there is at least undocumented behavior and you should probably report it.
Upvotes: 5