J.Sabree
J.Sabree

Reputation: 2536

case_when conditional using purrr that reads over column sequence

I'm trying to classify participants' current status in a course. This is an extension of this post: purrr pmap to read max column name by column name number. My dataset looks like this:

library(dplyr)
problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
                   status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
                   status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
                   status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                   status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"))

I want to classify people's current status. If someone has completed the course at any status, their status is "Completed." However, what's tricky is their registered status. Someone is "Registered" IF their final status is registered OR if the later status is "NA". They are NOT registered if a status after their registration is withdrawn or cancelled. So, the final dataset should look this:

library(dplyr)
solution <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
                   status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
                   status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
                   status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                   status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                   current = c("Not Taken", "Registered", "Completed", "Registered", "Registered", "Not Taken"))

Angela is not taken because she withdrew after her registration. Claire is registered because, despite her past withdrawals, she more recently registered. Justin is completed because he completed the course at any status. Bob is registered because he has not withdrawn or had the course cancelled. Similar to Claire, Joseph has registered more recently than his withdrawal, so he is registered. Finally, Gil is "Not Taken" because his course was canceled, and he doesn't have a more recent registration.

Here's my code:

library(tidyverse)
solution %>% 
  mutate(
    test =
      pmap_chr(select(., contains("status")), ~
        case_when(
          any(str_detect(c(...), "(?i)Completed")) ~ "Completed",
          any(str_detect(c(...), "(?i)Exempt")) | any(str_detect(c(...), "(?i)Incomplete")) ~ "Exclude",
          length(c(...) == "Registered") > length(c(...) == "Withdrawn") | length(c(...) == "Registered") > length(c(...) == "Cancelled")  ~ "Registered",
          any(str_detect(c(...), "(?i)No Show")) | any(str_detect(c(...), "(?i)Denied")) | any(str_detect(c(...), "(?i)Cancelled")) | any(str_detect(c(...), "(?i)Waitlist Expired")) || any(str_detect(c(...), "(?i)Withdrawn")) ~ "Not Taken",
          TRUE ~ "NA"
        )
      )
  )

I can't figure out how to crack the code with the registration portion. Ideally, I'd like to retain as much of this code as possible because my true dataset has many columns of status.

Thank you!

Upvotes: 0

Views: 673

Answers (1)

Clemens Hug
Clemens Hug

Reputation: 497

I think this problem is easier to solve if you first rearrange your data a bit according to "tidy data" principles, where each time a student's status changed is recorded in a separate row. The rearranged data is in problem_wrangled.

Then, the current status can usually be determined by looking only at the most recent status, with the exception of the "Completed" status, for which we check all past statuses.

library(tidyverse)

problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
                  status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
                  status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
                  status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                  status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"))

status_wrangled <- problem %>%
  gather(key = "time", value = "status", starts_with("status")) %>%
  mutate(time = as.integer(str_split_fixed(time, "_", 2)[, 2])) %>%
  arrange(name, time) %>%
  filter(status != "NA")

head(status_wrangled)
#> # A tibble: 6 x 3
#>   name   time  status    
#>   <chr>  <chr> <chr>     
#> 1 Angela 1     Registered
#> 2 Angela 2     Withdrawn 
#> 3 Angela 4     Withdrawn 
#> 4 Bob    1     Registered
#> 5 Claire 1     Withdrawn 
#> 6 Claire 2     Withdrawn

status_current <- status_wrangled %>%
  group_by(name) %>%
  summarize(
    current = case_when(
      # Has student completed at any time?
      "Completed" %in% status ~ "Completed",
      # Examine last recorded status
      tail(status, 1) %in% c("Exempt", "Incomplete") ~ "Exclude",
      tail(status, 1) %in% c("Withdrawn", "Cancelled", "No Show", "Denied", "Waitlist Expired") ~ "Not Taken",
      tail(status, 1) == "Registered" ~ "Registered",
      TRUE ~ "Unknown"
    )
  )

print(status_current, n = Inf)
#> # A tibble: 6 x 2
#>   name   current   
#>   <chr>  <chr>     
#> 1 Angela Not Taken 
#> 2 Bob    Registered
#> 3 Claire Registered
#> 4 Gil    Not Taken 
#> 5 Joseph Registered
#> 6 Justin Completed

Created on 2019-06-17 by the reprex package (v0.3.0)

EDIT: Regarding your comment about doing approximate matches: I modified the example to allow approximate matches up to a certain edit distance. You probably want to tune this, but allowing up to three edits or so seems reasonable. But be careful, the difference between "Incomplete" and "Completed" is just four edits.

library(tidyverse)

problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
                  status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
                  status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
                  status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                  status_4 = c("Withdrawnn", "Registered", "Withdrawn", "NA", "Registered", "NA"))

status_wrangled <- problem %>%
  gather(key = "time", value = "status", starts_with("status")) %>%
  mutate(time = as.integer(str_split_fixed(time, "_", 2)[, 2])) %>%
  arrange(name, time) %>%
  filter(status != "NA")

# Find if input vector matches to *any* given pattern below the specified edit distance
any_fuzzy_match <- function(x, patterns, max.distance = 3) {
  matches <- map(paste0("^", patterns, "$"), agrepl, x = x, max.distance = max.distance, fixed = FALSE)
  reduce(matches, `|`)
}

status_current <- status_wrangled %>%
  group_by(name) %>%
  summarize(
    current = case_when(
      # Has student completed at any time?
      any(any_fuzzy_match(status, "Completed")) ~ "Completed",
      # Examine last recorded status
      any_fuzzy_match(tail(status, 1), c("Exempt", "Incomplete")) ~ "Exclude",
      any_fuzzy_match(tail(status, 1), c("Withdrawn", "Cancelled", "No Show", "Denied", "Waitlist Expired")) ~ "Not Taken",
      any_fuzzy_match(tail(status, 1), "Registered") ~ "Registered",
      TRUE ~ "Unknown"
    )
  )

print(status_current, n = Inf)
#> # A tibble: 6 x 2
#>   name   current   
#>   <chr>  <chr>     
#> 1 Angela Not Taken 
#> 2 Bob    Registered
#> 3 Claire Registered
#> 4 Gil    Not Taken 
#> 5 Joseph Registered
#> 6 Justin Completed

Created on 2019-06-18 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions