Reputation: 1247
I am having some problem with understanding the prob
in sample
. For example I want to create a sample data set of size 100 with integers 1,2,3 & 4. I am using a probability of 0.1,0.2,0.3 & 0.4 respectively.
sample1<-sample(1:4,100,replace=T,prob=seq(0.1,0.4,0.1))
So, now I am expecting a sample with integers of 1,2,3 & 4 repeating 10,20,30 & 40 times respectively. But the result is different
> table(sample1)
sample1
1 2 3 4
7 24 33 36
Can anyone explain this? And what should I do if I want to get the expected results which is
> table(sample1)
sample1
1 2 3 4
10 20 30 40
Upvotes: 0
Views: 196
Reputation: 59385
sample(...)
takes a random sample with probabilities given in prob=...
, so you will not get exactly that proportion every time. On the other hand, the proportions get closer to those specified in prob
as n
increases:
f <- function(n)sample(1:4,n,replace=T,prob=(1:4)/10)
samples <- lapply(10^(2:6),f)
t(sapply(samples,function(x)c(n=length(x),table(x)/length(x))))
# n 1 2 3 4
# [1,] 1e+02 0.090000 0.220000 0.260000 0.430000
# [2,] 1e+03 0.076000 0.191000 0.309000 0.424000
# [3,] 1e+04 0.095300 0.200200 0.310100 0.394400
# [4,] 1e+05 0.099720 0.199800 0.302250 0.398230
# [5,] 1e+06 0.099661 0.199995 0.300223 0.400121
If you need a random sample with exactly those proportions, use rep(...)
and randomize the order.
g <- function(n) rep(1:4,n*(1:4)/10)[sample(1:n,n)]
samples <- lapply(10^(2:6),g)
t(sapply(samples,function(x)c(n=length(x),table(x)/length(x))))
# n 1 2 3 4
# [1,] 1e+02 0.1 0.2 0.3 0.4
# [2,] 1e+03 0.1 0.2 0.3 0.4
# [3,] 1e+04 0.1 0.2 0.3 0.4
# [4,] 1e+05 0.1 0.2 0.3 0.4
# [5,] 1e+06 0.1 0.2 0.3 0.4
Upvotes: 1
Reputation: 61953
sample
takes a sample with the specified probabilities. That implies randomness - you won't get the same result every time. To do what you want just use rep
rep(1:4, 100*seq(0.1,0.4,0.1))
Upvotes: 2