user1700890
user1700890

Reputation: 7730

Conditional probability experiment in R

Here is my code

library(dplyr)

rain_vector <- sample(c(0,1), 1000000, replace = T, prob= c(0.2,0.8))

for(el in 1:10){

df <- data.frame(rain = rain_vector )
df <- df %>% mutate(A= if_else(rain == 1, sample(c(0,1), 1, replace = T, prob= c(1/3,2/3)),
                          sample(c(0,1), 1, replace = T, prob= c(2/3,1/3))))

print(NROW(df[df$A==1,]))
print(NROW(df[df$A == 1 & df$rain == 1, ]))
print(NROW(df[df$rain == 1,]))
print("______________")

}

Here is the output:

[1] 0
[1] 0
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"

None of the results makes sense to me. Let us look at the last one. Case A = 1 is happening always, while it is supposed to happen with probability 1/3 and 2/3 depending on rain. Is there something about dplyr package. Any suggestions?

Upvotes: 0

Views: 42

Answers (1)

VFreguglia
VFreguglia

Reputation: 2311

The problem is that sample(c(0,1), 1, replace = T, prob= c(1/3,2/3)) has length 1, therefore it's repeating the value for each row.

Instead you could use rowwise() before your mutate call, so you explicitly tell that there should be a sample() call for each row.

  df <- df %>% 
    rowwise() %>%
    mutate(A= if_else(rain == 1, sample(c(0,1), 1, replace = T, prob= c(1/3,2/3)),
                                 sample(c(0,1), 1, replace = T, prob= c(2/3,1/3))))

Another faster option is to use base R and compute a single sample with appropriate length for each value of the two values of rain, reducing the number of calls to sample() from 1000000 to 2:

rain_vector <- sample(c(0,1), 1000000, replace = T, prob= c(0.2,0.8))

for(el in 1:10){

  df <- data.frame(rain = rain_vector, A = numeric(length(rain_vector)))
  df[rain_vector == 1, "A"] <- sample(c(0,1), sum(rain_vector==1), prob= c(1/3,2/3), replace = T)
  df[rain_vector == 0, "A"] <- sample(c(0,1), sum(rain_vector==0), prob= c(2/3,1/3), replace = T)

  print(NROW(df[df$A==1,]))
  print(NROW(df[df$A == 1 & df$rain == 1, ]))
  print(NROW(df[df$rain == 1,]))
  print("______________")

}

Upvotes: 1

Related Questions