Papa Analytica
Papa Analytica

Reputation: 187

Looping over indexes of specific column(s)

Good day everyone

Take this dataset:

df <- tibble(
  id = 1:1000,
  smoking = sample(c(TRUE, FALSE), length(id), prob = c(0.2,0.8), replace = TRUE),
  age = rnorm(length(id), mean = 60, sd = 10)
)

I want to add another logical variable lung_cancer to the dataframe where the TRUE or FALSE are assigned through probability distribution that is calculated based on the patients smoking and age status

I understand that this requires looping over each index, and I can manage to do it using For() loop so I wrote the following:

df$lung_cancer <- vector("logical", length(id))
for (i in seq_along(df$lung_cancer)) {
  df$lung_cancer[[i]] = if_else(df$age[[i]] > 50, case_when(
      df$age[[i]] > 50 & df$smoking[[i]] == TRUE ~ sample(c(TRUE, FALSE), 1, prob = c(0.05, 0.95)),
      df$age[[i]] > 50 & df$smoking[[i]] == FALSE ~ sample(c(TRUE, FALSE), 1, prob = c(0.001, 0.999))
    ), FALSE
  )
}

Now I find this to be too verbose, is there any concise way to write this with mutate() function and purrr package or any other way (preferably from tidyverse package collection)?

Upvotes: 1

Views: 193

Answers (2)

Brian Montgomery
Brian Montgomery

Reputation: 2414

data.table allows you to mutate a portion of a column. This way the samples can be generated only twice instead of 1000 times.

library(data.table)
set.seed(42)
df <- data.table(
  id = 1:1000,
  smoking = sample(c(TRUE, FALSE), length(id), prob = c(0.2,0.8), replace = TRUE),
  age = rnorm(length(id), mean = 60, sd = 10)
) %>% 
  .[, lung_cancer := FALSE] %>% 
  .[age > 50 & smoking, lung_cancer := sample(c(TRUE, FALSE), .N, prob = c(0.05, 0.95), replace = TRUE)] %>% 
  .[age > 50 & !smoking, lung_cancer := sample(c(TRUE, FALSE), .N, prob = c(0.001, 0.999), replace = TRUE)] %>% 
  .[]
  
df[, .(.N, lc = sum(lung_cancer)), keyby = smoking]
   smoking   N lc
1:   FALSE 804  2
2:    TRUE 196  5

I put a "report" at the end.
(You can convert your tibble to a data.table with setDT() instead, if necessary)

Upvotes: 0

Dave2e
Dave2e

Reputation: 24149

The case_when() function should be all you needed, but it does not seem to re-evaluating for each TRUE event.

Here is a simple base R solution taking advantage of R's vectorization ability (thus avoiding the loop).

#set all to the default value
df$lung_cancer<-FALSE

#perform the selections and then set to new value
df$lung_cancer[df$age > 50 & df$smoking == TRUE ] <- sample(c(TRUE, FALSE), nrow(df), prob = c(0.05, 0.95), replace = TRUE)[df$age > 50 & df$smoking == TRUE ]
df$lung_cancer[df$age > 50 & df$smoking == FALSE] <- sample(c(TRUE, FALSE), nrow(df), prob = c(0.001, 0.999), replace = TRUE)[df$age > 50 & df$smoking == FALSE]

Case_when() question
To define a default value with case_when, as your last test, define a TRUE statement. Such as in this example:

case_when (
   df$age> 50 & df$smoking == TRUE ~ "Group1",
   df$age > 50 & df$smoking == FALSE ~ "Group2",
   TRUE ~ "Everyone Else"
)

See ?case_when for more examples

Upvotes: 1

Related Questions