R_Sample with probabilities

Question

I am having some problem with understanding the prob in sample. For example I want to create a sample data set of size 100 with integers 1,2,3 & 4. I am using a probability of 0.1,0.2,0.3 & 0.4 respectively.

sample1<-sample(1:4,100,replace=T,prob=seq(0.1,0.4,0.1))

So, now I am expecting a sample with integers of 1,2,3 & 4 repeating 10,20,30 & 40 times respectively. But the result is different

> table(sample1)
sample1
 1  2  3  4 
 7 24 33 36

Can anyone explain this? And what should I do if I want to get the expected results which is

> table(sample1)
    sample1
     1  2  3  4 
    10 20 30 40

jlhoward · Accepted Answer

sample(...) takes a random sample with probabilities given in prob=..., so you will not get exactly that proportion every time. On the other hand, the proportions get closer to those specified in prob as n increases:

f <- function(n)sample(1:4,n,replace=T,prob=(1:4)/10)
samples <- lapply(10^(2:6),f)
t(sapply(samples,function(x)c(n=length(x),table(x)/length(x))))
#          n        1        2        3        4
# [1,] 1e+02 0.090000 0.220000 0.260000 0.430000
# [2,] 1e+03 0.076000 0.191000 0.309000 0.424000
# [3,] 1e+04 0.095300 0.200200 0.310100 0.394400
# [4,] 1e+05 0.099720 0.199800 0.302250 0.398230
# [5,] 1e+06 0.099661 0.199995 0.300223 0.400121

If you need a random sample with exactly those proportions, use rep(...) and randomize the order.

g <- function(n) rep(1:4,n*(1:4)/10)[sample(1:n,n)]
samples <- lapply(10^(2:6),g)
t(sapply(samples,function(x)c(n=length(x),table(x)/length(x))))
#          n   1   2   3   4
# [1,] 1e+02 0.1 0.2 0.3 0.4
# [2,] 1e+03 0.1 0.2 0.3 0.4
# [3,] 1e+04 0.1 0.2 0.3 0.4
# [4,] 1e+05 0.1 0.2 0.3 0.4
# [5,] 1e+06 0.1 0.2 0.3 0.4

R_Sample with probabilities

Answers (2)

Related Questions