lecb
lecb

Reputation: 409

Case_when issue using sum to classify data - R/dplyr solution

I'm probably doing something stupid here but would appreciate some help. I'm trying to classify some data that has been incorrectly filled in.

df <- data.frame(ID = c("A", "A", "A","A", "A", "B", "B", "B", "B", "B"),
                 headache_y_n = c("Yes", "Yes", "Yes", "No", "Yes", "No", "No", "No", "Yes", "No"),
                 headache_days =c("2", "2", "2", "2", "2", "1", "1", "1", "1", "1"))

I want to be able to say, if headache_y_n is yes more than 3 times, per ID, then it meets criteria of "prolonged", else it should be "short".

Therefore, I want the following output:

output <- data.frame(ID = c("A", "A", "A","A", "A", "B", "B", "B", "B", "B"),
                 headache_y_n = c("Yes", "Yes", "Yes", "No", "Yes", "No", "No", "No", "Yes", "No"),
                 headache_days =c("2", "2", "2", "2", "2", "1", "1", "1", "1", "1"),
                 criteria =c("prolonged", "prolonged", "prolonged", "prolonged", "prolonged", "short", "short", "short", "short", "short"))

My code is as follows:

library(dplyr)
df %>% group_by(ID) %>% mutate(criteria=case_when(
    sum(any(headache_y_n=="Yes") >= 3) ~ "prolonged",
    TRUE ~ "short"
))

Unfortunately it doesn't work and I get the following error:

Error: Problem with `mutate()` input `criteria`.
x LHS of case 1 (`sum(any(headache_y_n == "Yes") >= 3)`) must be a logical vector, not an integer vector.
ℹ Input `criteria` is `case_when(...)`.
ℹ The error occurred in group 1: ID = "A".

I'm not smart enough to figure out where I'm going wrong, hence why I'm asking kindly for your help!

Thanks!

Upvotes: 2

Views: 827

Answers (1)

akrun
akrun

Reputation: 887511

The any and sum should be switched i.e. after grouping by 'ID', we are counting the number of 'Yes' i.e. the sum of logical expression (headache_y_n == 'Yes'), then create a second expression after the sum >=3, wrap it with any to match (probably not needed here as the sum is only a single value)

library(dplyr)
df %>%
     group_by(ID) %>%
     mutate(criteria=case_when(
        any(sum(headache_y_n=="Yes") >= 3) ~ "prolonged",
         TRUE ~ "short"
    ))

i.e. even if remove the any, it returns the same

df %>%
      group_by(ID) %>%
      mutate(criteria=case_when(
         sum(headache_y_n=="Yes") >= 3 ~ "prolonged",
          TRUE ~ "short"
     ))
# A tibble: 10 x 4
# Groups:   ID [2]
#   ID    headache_y_n headache_days criteria 
#   <chr> <chr>        <chr>         <chr>    
# 1 A     Yes          2             prolonged
# 2 A     Yes          2             prolonged
# 3 A     Yes          2             prolonged
# 4 A     No           2             prolonged
# 5 A     Yes          2             prolonged
# 6 B     No           1             short    
# 7 B     No           1             short    
# 8 B     No           1             short    
# 9 B     Yes          1             short    
#10 B     No           1             short

Upvotes: 2

Related Questions