Reputation: 7730
Here is my code
library(dplyr)
rain_vector <- sample(c(0,1), 1000000, replace = T, prob= c(0.2,0.8))
for(el in 1:10){
df <- data.frame(rain = rain_vector )
df <- df %>% mutate(A= if_else(rain == 1, sample(c(0,1), 1, replace = T, prob= c(1/3,2/3)),
sample(c(0,1), 1, replace = T, prob= c(2/3,1/3))))
print(NROW(df[df$A==1,]))
print(NROW(df[df$A == 1 & df$rain == 1, ]))
print(NROW(df[df$rain == 1,]))
print("______________")
}
Here is the output:
[1] 0
[1] 0
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"
None of the results makes sense to me. Let us look at the last one. Case A = 1
is happening always, while it is supposed to happen with probability 1/3 and 2/3 depending on rain. Is there something about dplyr
package. Any suggestions?
Upvotes: 0
Views: 42
Reputation: 2311
The problem is that sample(c(0,1), 1, replace = T, prob= c(1/3,2/3))
has length 1, therefore it's repeating the value for each row.
Instead you could use rowwise()
before your mutate call, so you explicitly tell that there should be a sample()
call for each row.
df <- df %>%
rowwise() %>%
mutate(A= if_else(rain == 1, sample(c(0,1), 1, replace = T, prob= c(1/3,2/3)),
sample(c(0,1), 1, replace = T, prob= c(2/3,1/3))))
Another faster option is to use base R and compute a single sample with appropriate length for each value of the two values of rain
, reducing the number of calls to sample()
from 1000000 to 2:
rain_vector <- sample(c(0,1), 1000000, replace = T, prob= c(0.2,0.8))
for(el in 1:10){
df <- data.frame(rain = rain_vector, A = numeric(length(rain_vector)))
df[rain_vector == 1, "A"] <- sample(c(0,1), sum(rain_vector==1), prob= c(1/3,2/3), replace = T)
df[rain_vector == 0, "A"] <- sample(c(0,1), sum(rain_vector==0), prob= c(2/3,1/3), replace = T)
print(NROW(df[df$A==1,]))
print(NROW(df[df$A == 1 & df$rain == 1, ]))
print(NROW(df[df$rain == 1,]))
print("______________")
}
Upvotes: 1