Reputation: 1362
My data looks like this:
d
#> # A tibble: 220 x 2
#> smoker pain
#> <chr> <chr>
#> 1 Smoker Pain
#> 2 Smoker Pain
#> 3 Smoker Pain
#> 4 Smoker Pain
#> 5 Smoker Pain
#> 6 Smoker Pain
#> 7 Smoker Pain
#> 8 Smoker Pain
#> 9 Smoker Pain
#> 10 Smoker Pain
#> # … with 210 more rows
Is a combination between two variables: smokers and pain.
d %>%
count(smoker, pain, sort = T)
#> # A tibble: 4 x 3
#> smoker pain n
#> <chr> <chr> <int>
#> 1 No smoker No pain 107
#> 2 Smoker Pain 70
#> 3 Smoker No pain 35
#> 4 No smoker Pain 8
I want to calculate the probability of a person feeling pain given he is a smoker P(pain|smoker):
library(tidyverse)
library(prob)
d <- probspace(d)
Prob(d, event = smoker == "Smoker", given = pain == "Pain")
#> [1] 0.01282051
As far as I know this value must be the percentage of smokers that feel pain:
70/105
#> [1] 0.667
What is wrong here?
This is the code for the data:
smoker <- c(rep("Smoker", 105), rep("No smoker", 115))
pain <- c(rep("Pain", 70), rep("No pain", 35), rep("Pain", 8), rep("No pain", 107))
d <- tibble(smoker, pain)
Upvotes: 2
Views: 1293
Reputation: 101335
I think you should add one more line d <- cbind(id = seq(nrow(d)),d)
after d <- tibble(smoker, pain)
, i.e.,
d <- tibble(smoker, pain)
d <- cbind(id = seq(nrow(d)),d)
then you will get the desired result
> Prob(d, event = pain == "Pain", given = smoker == "Smoker")
[1] 0.6666667
NOTE: The reason behind of doing this is that, Prob()
calculates the intersect()
between event and given condition. When you are using data frames for the probability space, the duplicates in the intersection will be dropped. To avoid that, you need to manually add extra information to distinguish rows in the data frame d
, such that all duplicates can be saved till the end of calculation.
Upvotes: 1