Reputation: 2536
I'm trying to classify participants' current status in a course. This is an extension of this post: purrr pmap to read max column name by column name number. My dataset looks like this:
library(dplyr)
problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"))
I want to classify people's current status. If someone has completed the course at any status, their status is "Completed." However, what's tricky is their registered status. Someone is "Registered" IF their final status is registered OR if the later status is "NA". They are NOT registered if a status after their registration is withdrawn or cancelled. So, the final dataset should look this:
library(dplyr)
solution <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"),
current = c("Not Taken", "Registered", "Completed", "Registered", "Registered", "Not Taken"))
Angela is not taken because she withdrew after her registration. Claire is registered because, despite her past withdrawals, she more recently registered. Justin is completed because he completed the course at any status. Bob is registered because he has not withdrawn or had the course cancelled. Similar to Claire, Joseph has registered more recently than his withdrawal, so he is registered. Finally, Gil is "Not Taken" because his course was canceled, and he doesn't have a more recent registration.
Here's my code:
library(tidyverse)
solution %>%
mutate(
test =
pmap_chr(select(., contains("status")), ~
case_when(
any(str_detect(c(...), "(?i)Completed")) ~ "Completed",
any(str_detect(c(...), "(?i)Exempt")) | any(str_detect(c(...), "(?i)Incomplete")) ~ "Exclude",
length(c(...) == "Registered") > length(c(...) == "Withdrawn") | length(c(...) == "Registered") > length(c(...) == "Cancelled") ~ "Registered",
any(str_detect(c(...), "(?i)No Show")) | any(str_detect(c(...), "(?i)Denied")) | any(str_detect(c(...), "(?i)Cancelled")) | any(str_detect(c(...), "(?i)Waitlist Expired")) || any(str_detect(c(...), "(?i)Withdrawn")) ~ "Not Taken",
TRUE ~ "NA"
)
)
)
I can't figure out how to crack the code with the registration portion. Ideally, I'd like to retain as much of this code as possible because my true dataset has many columns of status.
Thank you!
Upvotes: 0
Views: 673
Reputation: 497
I think this problem is easier to solve if you first rearrange your data a bit according to "tidy data" principles, where each time a student's status changed is recorded in a separate row. The rearranged data is in problem_wrangled
.
Then, the current status can usually be determined by looking only at the most recent status, with the exception of the "Completed" status, for which we check all past statuses.
library(tidyverse)
problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"))
status_wrangled <- problem %>%
gather(key = "time", value = "status", starts_with("status")) %>%
mutate(time = as.integer(str_split_fixed(time, "_", 2)[, 2])) %>%
arrange(name, time) %>%
filter(status != "NA")
head(status_wrangled)
#> # A tibble: 6 x 3
#> name time status
#> <chr> <chr> <chr>
#> 1 Angela 1 Registered
#> 2 Angela 2 Withdrawn
#> 3 Angela 4 Withdrawn
#> 4 Bob 1 Registered
#> 5 Claire 1 Withdrawn
#> 6 Claire 2 Withdrawn
status_current <- status_wrangled %>%
group_by(name) %>%
summarize(
current = case_when(
# Has student completed at any time?
"Completed" %in% status ~ "Completed",
# Examine last recorded status
tail(status, 1) %in% c("Exempt", "Incomplete") ~ "Exclude",
tail(status, 1) %in% c("Withdrawn", "Cancelled", "No Show", "Denied", "Waitlist Expired") ~ "Not Taken",
tail(status, 1) == "Registered" ~ "Registered",
TRUE ~ "Unknown"
)
)
print(status_current, n = Inf)
#> # A tibble: 6 x 2
#> name current
#> <chr> <chr>
#> 1 Angela Not Taken
#> 2 Bob Registered
#> 3 Claire Registered
#> 4 Gil Not Taken
#> 5 Joseph Registered
#> 6 Justin Completed
Created on 2019-06-17 by the reprex package (v0.3.0)
EDIT: Regarding your comment about doing approximate matches: I modified the example to allow approximate matches up to a certain edit distance. You probably want to tune this, but allowing up to three edits or so seems reasonable. But be careful, the difference between "Incomplete" and "Completed" is just four edits.
library(tidyverse)
problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
status_4 = c("Withdrawnn", "Registered", "Withdrawn", "NA", "Registered", "NA"))
status_wrangled <- problem %>%
gather(key = "time", value = "status", starts_with("status")) %>%
mutate(time = as.integer(str_split_fixed(time, "_", 2)[, 2])) %>%
arrange(name, time) %>%
filter(status != "NA")
# Find if input vector matches to *any* given pattern below the specified edit distance
any_fuzzy_match <- function(x, patterns, max.distance = 3) {
matches <- map(paste0("^", patterns, "$"), agrepl, x = x, max.distance = max.distance, fixed = FALSE)
reduce(matches, `|`)
}
status_current <- status_wrangled %>%
group_by(name) %>%
summarize(
current = case_when(
# Has student completed at any time?
any(any_fuzzy_match(status, "Completed")) ~ "Completed",
# Examine last recorded status
any_fuzzy_match(tail(status, 1), c("Exempt", "Incomplete")) ~ "Exclude",
any_fuzzy_match(tail(status, 1), c("Withdrawn", "Cancelled", "No Show", "Denied", "Waitlist Expired")) ~ "Not Taken",
any_fuzzy_match(tail(status, 1), "Registered") ~ "Registered",
TRUE ~ "Unknown"
)
)
print(status_current, n = Inf)
#> # A tibble: 6 x 2
#> name current
#> <chr> <chr>
#> 1 Angela Not Taken
#> 2 Bob Registered
#> 3 Claire Registered
#> 4 Gil Not Taken
#> 5 Joseph Registered
#> 6 Justin Completed
Created on 2019-06-18 by the reprex package (v0.3.0)
Upvotes: 1