Mahm00d27
Mahm00d27

Reputation: 39

obtain maximum date from multiple leading dates

Sample data:

data <- data.frame(
  year = c(2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020),
  patient_id = c(101, 102, 103, 104, 102, 102, 106, 105, 105, 107, 108, 109, 105, 105, 201, 202, 203, 204, 205, 205, 205, 209, 208),
  discharge_date = as.Date(c("1/1/2018", "1/5/2018", "1/8/2018", "2/5/2018", "2/10/2018", "2/11/2018", "3/1/2018", "1/2/2019", "1/10/2019", "3/1/2019", "3/5/2019", "3/25/2019", "5/5/2019", "5/6/2019", "1/1/2020", "2/1/2020", "2/10/2020", "3/3/2020", "4/1/2020", "4/2/2020", "4/3/2020", "6/17/2020", "8/8/2020"), format = "%m/%d/%Y"),
  contagious_admission = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0)
)

I am trying to create the column latest_discharge_date, which takes in the discharge_date value following the following logic: If patient_id of current row is different from the leading patient_id, then it takes in the discharge_date from current row. If patient_id of current row is the same as the leading patient_id, AND in both rows contagious_admission == 0, then it takes in the discharge_date from current row. But when there is leading single or multiple consecutive rows with same patient_id, AND in the first row of such series, contagious_admission == 0, in the latter ones, contagious_admission == 1, the latest_discharge_date of the first row should take in the latest or the maximum discharge_date from the consecutive rows of contagious_admission == 1.

1st Attempt:

data |>
  mutate(
    latest_discharge_date = case_when(
      contagious_admission == 0 & lead(contagious_admission) == 0 ~ discharge_date,
      contagious_admission == 0 & lead(contagious_admission) == 1 ~ lead(discharge_date)
    , TRUE ~ NA)
  )

Everything worked fine, but if you look at patient_id = 205, latest_discharge_date of the index row (patient_id == 205 & contagious_admission == 0) is taking in "2020-04-02". But I need it to take the next leading date (belonging to the same patient_id and contagious_admission == 1 "group"), which is "2020-04-03".

2nd Attempt:

data |>
  mutate(
    latest_discharge_date = case_when(
      contagious_admission == 0 & lead(contagious_admission) == 0 ~ discharge_date,
      contagious_admission == 0 & lead(contagious_admission) == 1 ~ lead(pmax(lead(discharge_date)))
    , TRUE ~ NA)
  )

This one nails the ones with multiple leading rows of contagious_admission == 1, but overshoots the single ones.

Upvotes: 1

Views: 51

Answers (1)

Ben
Ben

Reputation: 30539

One approach is to assign a count for each admission for each patient_id (this allows for patients to have more than one admission easily). Each row where the contagious_admission is zero, it is considered a new admission.

Then, you can group based on both the patient_id and the count, and if contagious_admission is 0, then use the last discharge_date, otherwise mark as missing (as in the example).

library(tidyverse)

data |>
  group_by(patient_id) |>
  mutate(adm_count = cumsum(contagious_admission == 0)) |>
  group_by(patient_id, adm_count) |>
  mutate(latest_discharge_date = if_else(
    contagious_admission == 0,
    last(discharge_date),
    NA)) |>
  ungroup() |>
  select(-adm_count)

Output

    year patient_id discharge_date contagious_admission latest_discharge_date
   <dbl>      <dbl> <date>                        <dbl> <date>               
 1  2018        101 2018-01-01                        0 2018-01-01           
 2  2018        102 2018-01-05                        0 2018-01-05           
 3  2018        103 2018-01-08                        0 2018-01-08           
 4  2018        104 2018-02-05                        0 2018-02-05           
 5  2018        102 2018-02-10                        0 2018-02-11           
 6  2018        102 2018-02-11                        1 NA                   
 7  2018        106 2018-03-01                        0 2018-03-01           
 8  2019        105 2019-01-02                        0 2019-01-02           
 9  2019        105 2019-01-10                        0 2019-01-10           
10  2019        107 2019-03-01                        0 2019-03-01           
11  2019        108 2019-03-05                        0 2019-03-05           
12  2019        109 2019-03-25                        0 2019-03-25           
13  2019        105 2019-05-05                        0 2019-05-06           
14  2019        105 2019-05-06                        1 NA                   
15  2020        201 2020-01-01                        0 2020-01-01           
16  2020        202 2020-02-01                        0 2020-02-01           
17  2020        203 2020-02-10                        0 2020-02-10           
18  2020        204 2020-03-03                        0 2020-03-03           
19  2020        205 2020-04-01                        0 2020-04-03           
20  2020        205 2020-04-02                        1 NA                   
21  2020        205 2020-04-03                        1 NA                   
22  2020        209 2020-06-17                        0 2020-06-17           
23  2020        208 2020-08-08                        0 2020-08-08

Upvotes: 1

Related Questions