Nashe
Nashe

Reputation: 55

How to get the test with the highest score if enrollment date and test date are the same?

I have a dataset and I am trying to test the efficiency of a pre-enrollment course. I have enrollment date, test date, subject and the results. The students are grouped into

  1. group 1 - 30 days before enrollment
  2. group 2 - 30 days after enrollment
  3. group 3 - 45 days before enrollment up to 35 days after enrollment.

Each registration ID should fall into one group 1 being first priority, group 2 being second priority and group 3 being last. However, a student may have multiple tests on the same day but we should capture the registration id with the highest test score. If student falls out of the -45 days to 30 days after enrollment they are should be written as not classified​

Below is the data:

data <- data.frame(
      student_id = c("a53e83bzz", "a53e83bzz", "a53e83bzz", "2034cccc", "2034cccc", "2034cccc", "2034cccc", "202353bbbb", "202353bbbb", "1980polkfbb", "1980polkfbb"),
 registration_id = c("a53-ffe9", "a53-ffe9", "a53-ffe9", "203-ffde", "203-ffde", "203-ffde", "203-ffde", "202-ffcc", "202-ffcc", "198-ffb", "198-ffb"),
 subject = c("maths", "maths", "maths", "maths", "maths", "maths", "maths", "english", "english", "english", "english"),
enrollment_date = as.Date(c("2021-02-28", "2021-02-28", "2021-02-28", "2019-03-25", "2019-03-25", "2019-03-25", "2019-03-25", "2021-05-22", "2021-05-22", "2019-07-04", "2019-07-04"), format="%Y-%m-%d"),
test_score_category = c(0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 2),
test_date = as.Date(c("2021-02-27", "2021-02-27", "2022-07-08", "2019-02-18", "2019-03-11", "2020-04-07", "2020-04-07", "2021-06-17", "2021-06-07", "2019-03-14", "2019-03-28"), format="%Y-%m-%d"),
difference = c(-1, -1, 495, -35, -14, 379, 379, 26, 16, -112, -98)
 )

This is what I have tried in R, but I am not getting the exact results I want.

result <- data %>%
group_by(student_id, registration_id) %>%
arrange(student_id, group) %>%  # Prioritize by group (1 > 2 > 3)
slice_max(order_by = test_score_category) %>%  
ungroup()

Below is the result I am expecting

df <- data.frame(
  student_id = c("a53e83bzz", "2034cccc", "202353bbbb", "1980polkfbb"),
  registration_id = c("a53-ffe9", "203-ffde", "202-ffcc", "198-ffb"),
  subject = c("maths", "maths", "english", "english"),
  enrollment_date = as.Date(c("2021-02-28", "2019-03-25", "2021-05-22", "2019-07-04"), format="%Y-%m-%d"),
  test_score_category = c(1, 0, 0, NA),  # Use NA for not_classified
  test_date = as.Date(c("2021-02-27", "2019-03-11", "2021-06-07", NA), format="%Y-%m-%d"),  # Use NA for not_classified
  difference = c(-1, -14, 16, NA)  # Use NA for not_classified
  )

Upvotes: 3

Views: 94

Answers (2)

Jan
Jan

Reputation: 9263

library(dplyr)

data |> 
  mutate(group = case_when(between(difference, -30, 0) ~ 1,
                           between(difference, 0, 45) ~ 2,
                           between(difference, -45, 30) ~ 3)) |>
  slice_min(data.frame(group, difference, -test_score_category), by = student_id) |> 
  mutate(across(c(starts_with("test"), "difference"), ~ replace(.x, is.na(group), NA))) |>
  select(-group)

#   student_id registration_id subject enrollment_date test_score_category
#1   a53e83bzz        a53-ffe9   maths      2021-02-28                   1
#2    2034cccc        203-ffde   maths      2019-03-25                   0
#3  202353bbbb        202-ffcc english      2021-05-22                   0
#4 1980polkfbb         198-ffb english      2019-07-04                  NA
#   test_date difference
#1 2021-02-27         -1
#2 2019-03-11        -14
#3 2021-06-07         16
#4       <NA>         NA

Data:

> dput(data)
structure(list(student_id = c("a53e83bzz", "a53e83bzz", "a53e83bzz", 
"2034cccc", "2034cccc", "2034cccc", "2034cccc", "202353bbbb", 
"202353bbbb", "1980polkfbb", "1980polkfbb"), registration_id = c("a53-ffe9", 
"a53-ffe9", "a53-ffe9", "203-ffde", "203-ffde", "203-ffde", "203-ffde", 
"202-ffcc", "202-ffcc", "198-ffb", "198-ffb"), subject = c("maths", 
"maths", "maths", "maths", "maths", "maths", "maths", "english", 
"english", "english", "english"), enrollment_date = structure(c(18686, 
18686, 18686, 17980, 17980, 17980, 17980, 18769, 18769, 18081, 
18081), class = "Date"), test_score_category = c(0, 1, 0, 0, 
0, 0, 1, 1, 0, 1, 2), test_date = structure(c(18685, 18685, 19181, 
17945, 17966, 18359, 18359, 18795, 18785, 17969, 17983), class = "Date"), 
    difference = c(-1, -1, 495, -35, -14, 379, 379, 26, 16, -112, 
    -98)), class = "data.frame", row.names = c(NA, -11L))

Upvotes: 0

Kenneth K
Kenneth K

Reputation: 1

Base request

For the basic task of conditional grouping by the date difference, and keeping a single group assignation per student_id, the following should work:

library(dplyr)

data |> 
  mutate(
    group = case_when(
      -30 <= difference & difference <= 0  ~ "1",
      0    < difference & difference <= 30 ~ "2",
      -45 <= difference & difference <= 35 ~ "3",
      .default = NA
    ) |> 
      factor(level = c("1", "2", "3"))
  ) |> 
  arrange(student_id,group) |> 
  slice_head(n = 1,by = student_id)

The trick used here is to rely on factors being ordered by their level to ensure they are sorted as required by arrange(), and keep one line per student_id.

Detailed matching

To replicate fully what you shared as df:

data |> 
  mutate(
    group = case_when(
      -30 <= difference & difference <= 0  ~ "1",
      0    < difference & difference <= 30 ~ "2",
      -45 <= difference & difference <= 35 ~ "3",
      .default = NA
    ) |> factor(level = c("1", "2", "3"))
  ) |> 
  arrange(desc(student_id),group,difference,desc(test_score_category)) |> 
  slice_head(n = 1,by = student_id) |> 
  mutate(
    test_score_category = if_else(is.na(group), NA, test_score_category),
    test_date = if_else(is.na(group), NA, test_date),
    difference = if_else(is.na(group), NA, difference)
  ) |> 
  select(-group)

Upvotes: 0

Related Questions