Reputation: 187
Good day everyone
Take this dataset:
df <- tibble(
id = 1:1000,
smoking = sample(c(TRUE, FALSE), length(id), prob = c(0.2,0.8), replace = TRUE),
age = rnorm(length(id), mean = 60, sd = 10)
)
I want to add another logical variable lung_cancer
to the dataframe where the TRUE
or FALSE
are assigned through probability distribution that is calculated based on the patients smoking
and age
status
I understand that this requires looping over each index, and I can manage to do it using For()
loop so I wrote the following:
df$lung_cancer <- vector("logical", length(id))
for (i in seq_along(df$lung_cancer)) {
df$lung_cancer[[i]] = if_else(df$age[[i]] > 50, case_when(
df$age[[i]] > 50 & df$smoking[[i]] == TRUE ~ sample(c(TRUE, FALSE), 1, prob = c(0.05, 0.95)),
df$age[[i]] > 50 & df$smoking[[i]] == FALSE ~ sample(c(TRUE, FALSE), 1, prob = c(0.001, 0.999))
), FALSE
)
}
Now I find this to be too verbose, is there any concise way to write this with mutate()
function and purrr
package or any other way (preferably from tidyverse
package collection)?
Upvotes: 1
Views: 193
Reputation: 2414
data.table
allows you to mutate a portion of a column. This way the samples can be generated only twice instead of 1000 times.
library(data.table)
set.seed(42)
df <- data.table(
id = 1:1000,
smoking = sample(c(TRUE, FALSE), length(id), prob = c(0.2,0.8), replace = TRUE),
age = rnorm(length(id), mean = 60, sd = 10)
) %>%
.[, lung_cancer := FALSE] %>%
.[age > 50 & smoking, lung_cancer := sample(c(TRUE, FALSE), .N, prob = c(0.05, 0.95), replace = TRUE)] %>%
.[age > 50 & !smoking, lung_cancer := sample(c(TRUE, FALSE), .N, prob = c(0.001, 0.999), replace = TRUE)] %>%
.[]
df[, .(.N, lc = sum(lung_cancer)), keyby = smoking]
smoking N lc
1: FALSE 804 2
2: TRUE 196 5
I put a "report" at the end.
(You can convert your tibble to a data.table with setDT()
instead, if necessary)
Upvotes: 0
Reputation: 24149
The case_when()
function should be all you needed, but it does not seem to re-evaluating for each TRUE event.
Here is a simple base R solution taking advantage of R's vectorization ability (thus avoiding the loop).
#set all to the default value
df$lung_cancer<-FALSE
#perform the selections and then set to new value
df$lung_cancer[df$age > 50 & df$smoking == TRUE ] <- sample(c(TRUE, FALSE), nrow(df), prob = c(0.05, 0.95), replace = TRUE)[df$age > 50 & df$smoking == TRUE ]
df$lung_cancer[df$age > 50 & df$smoking == FALSE] <- sample(c(TRUE, FALSE), nrow(df), prob = c(0.001, 0.999), replace = TRUE)[df$age > 50 & df$smoking == FALSE]
Case_when() question
To define a default value with case_when, as your last test, define a TRUE statement. Such as in this example:
case_when (
df$age> 50 & df$smoking == TRUE ~ "Group1",
df$age > 50 & df$smoking == FALSE ~ "Group2",
TRUE ~ "Everyone Else"
)
See ?case_when
for more examples
Upvotes: 1