Kaya
Kaya

Reputation: 115

R: Expand rows according to start and end date and calculate hours between days

My question extends this one: Generate rows between two dates into a data frame in R

I have a dataset on admissions, discharges and lengths of stay (Stay_in_days) of patients from a hospital. It looks like this:

ID Admission           Discharge             Stay_in_days 
1    2020-08-20 15:25:03 2020-08-21 21:09:34 1.239
2    2020-10-04 21:53:43 2020-10-09 11:02:57 4.548
... 

Dates are in POSIXct format so far.

I aim for this:

ID   Date                 Stay_in_days 
1    2020-08-20 15:25:03  0.357 
1    2020-08-21 21:09:49  1.239
2    2020-10-04 21:53:43  0.087
2    2020-10-05 00:00:00  1.087
2    2020-10-06 00:00:00  2.087
2    2020-10-07 00:00:00  3.087
2    2020-10-08 00:00:00  4.087
2    2020-10-09 11:02:57  4.548
...

What I have done so far:

M <- Map(seq, patients$Admission, patients$Discharge, by = "day")
patients2 <- data.frame(
  ID = rep.int(patients$ID, vapply(M, length, 1L)), 
  Date = do.call(c, M)
) 

patients <- patients %>%
mutate(
 Date2=as.Date(Date, format = "%Y-%m-%d"),
 Dat2=Date2+1,
 Diff=difftime(Date2, Date, units = "days")
)


but this gives me:

ID   Date                 Date2      Diff
1    2020-08-20 17:25:03  2020-08-21 0.375
1    2020-08-21 17:25:03  2020-08-22 0.357
2    2020-10-04 23:53:43  2020-10-05 0.087
2    2020-10-05 23:53:43  2020-10-06 0.087
2    2020-10-06 23:53:43  2020-10-07 0.087
2    2020-10-07 23:53:43  2020-10-08 0.087
2    2020-10-08 23:53:43  2020-10-09 0.087
...

Strangely enough, it adds two hours to the Admission date but calculates the correct length of stay. Can someone explain?

Here is some data:

structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20), Admission = structure(c(1597937103.872, 
1598717768.704, 1599060521.984, 1599758087.168, 1599815496.704, 
1600702198.784, 1600719631.36, 1601065923.584, 1601119400.96, 
1601215476.736, 1601236710.4, 1601416934.4, 1601499640.832, 1601545647.104, 
1601587328, 1601644868.608, 1601741206.528, 1601848423.424, 1601901245.44, 
1601913828.352), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Discharge = structure(c(1598044189.696, 1598897337.344, 1599144670.208, 
    1599845118.976, 1599842366.464, 1602733683.712, 1603372135.424, 
    1601125168.128, 1601314173.952, 1605193905.152, 1602190259.2, 
    1601560720.384, 1601737143.296, 1602705634.304, 1602410460.16, 
    1602698425.344, 1601770566.656, 1602241377.28, 1602780476.416, 
    1602612048.896), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Stay_in_days = c(1.239, 2.078, 0.974, 1.007, 0.311, 23.513, 
    30.7, 0.686, 2.254, 46.047, 11.036, 1.664, 2.749, 13.426, 
    9.527, 12.194, 0.34, 4.548, 10.176, 8.081)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

Thanks in advance for your help!

Upvotes: 2

Views: 492

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26218

Though it is a bit crude but it'll work

library(tidyverse)
library(lubridate)

df %>% 
  pivot_longer(cols = -c(ID, Stay_in_days), names_to = "Event", values_to = "DATE") %>%
  group_by(ID) %>%
  mutate(dummy = case_when(Event == "Admission" ~ 0,
                           Event == "Discharge" ~ max(floor(Stay_in_days),1))) %>%
  complete(dummy = seq(min(dummy), max(dummy), 1)) %>%
  mutate(Event = ifelse(is.na(Event), "Dummy", Event),
         DATE = if_else(is.na(DATE), first(DATE)+dummy*24*60*60, DATE),
         Stay_in_days = case_when(Event == "Admission" ~ as.numeric(difftime(ceiling_date(DATE, "day"), DATE, units = "days")),
                                   Event == "Discharge" ~ Stay_in_days,
                                   TRUE ~ dummy + as.numeric(difftime(ceiling_date(first(DATE), "day"), first(DATE), units = "days")))) %>%
  select(ID, DATE, Stay_in_days)

# A tibble: 199 x 3
# Groups:   ID [20]
      ID DATE                Stay_in_days
   <dbl> <dttm>                     <dbl>
 1     1 2020-08-20 15:25:03        0.358
 2     1 2020-08-21 21:09:49        1.24 
 3     2 2020-08-29 16:16:08        0.322
 4     2 2020-08-30 16:16:08        1.32 
 5     2 2020-08-31 18:08:57        2.08 
 6     3 2020-09-02 15:28:41        0.355
 7     3 2020-09-03 14:51:10        0.974
 8     4 2020-09-10 17:14:47        0.281
 9     4 2020-09-11 17:25:18        1.01 
10     5 2020-09-11 09:11:36        0.617
# ... with 189 more rows

Explanation of logic For the first date in every ID, the stay_in_days gives the duration from admission date-time to following 24 Hrs. For intermediate dates, it just adds 1 to previous value. For discharge_date it retains the stay value calculated prior to pivoting. Hope this was you after.

Explanation of code After pivoting longer, I used a dummy column to create intermediate date-time objects. After that I just mutate the columns for generating output as described above.

Upvotes: 1

TarJae
TarJae

Reputation: 78917

You can achieve this with pivot_longer from tidyr. Edit: with comments:

df1 <- df %>% 
  select(ID = ID, date1 = Admission, date2 = Discharge, Stay_in_days) %>% # prepare for pivoting
  pivot_longer(
    cols = starts_with("date"),
    names_to = "Date1",
    values_to = "Date",
  ) %>% # pivot to longformat
  select(-Date1) %>% # remove temporary Date1
  relocate(Stay_in_days, .after = Date) %>% # change column order
  group_by(ID) %>%
  mutate(idgroup = rep(row_number(), each=1:2, length.out = n())) %>% # id for admission = 1 and for discharge id = 2
  mutate(Stay_in_days = replace(Stay_in_days, row_number() == 1, 0)) %>%  # set Admission to zero
  ungroup() 

enter image description here

Upvotes: 1

Related Questions