Calculate time difference between observations for some rows only

Question

I've got a dataset that looks like this:

 id      date         outcome
 ny21    2021-03-01   0
 ny21    2021-02-01   1
 ny21    2021-01-01   0
 ch67    2021-08-09   0

I can calculate the time difference between rows grouped by id in the following way:

 unlist(tapply(df$date, INDEX = df$id, FUN = function(x) c(0, `units<-`(diff(x), "days"))))

However, I also want to calculate the time difference between rows grouped by id, where outcome == 1. If the current row has outcome == 1, and no such previous row exists, I want to put 0. So, something like this:

 id      date         outcome  days_since_outcome_1
 ny21    2021-03-01   0        28
 ny21    2021-02-01   1        0
 ny21    2021-01-01   0        0
 ch67    2021-08-09   0        0

How can I do this?

EDIT:

In some situations, there are more than one outcome == 1 per id. In this case, I would like the code to calculate the difference since the latest match (not the first match). Something like this:

 id      date         outcome  days_since_outcome_1
 ny21    2021-05-01   1        30  # this is days since 2021-04-01
 ny21    2021-04-01   1        59  # this is days since 2021-02-01
 ny21    2021-03-01   0        28  # this is days since 2021-02-01 (first match)
 ny21    2021-02-01   1        0
 ny21    2021-01-01   0        0
 ch67    2021-08-09   0        0

Ronak Shah · Accepted Answer

Using match get the corresponding date where outcome = 1 for a id. Change the values which are before the outcome date to 0 along with those values which do not have any value with outcome = 1.

library(dplyr)

df %>%
  mutate(date = as.Date(date)) %>%
  group_by(id) %>%
  mutate(days_since_outcome_1 = as.integer(date - date[match(1, outcome)]), 
         days_since_outcome_1 = replace(days_since_outcome_1, 
               days_since_outcome_1 < 0 | is.na(days_since_outcome_1), 0)) %>%
  ungroup


#   id    date       outcome days_since_outcome_1
#                          
#1 ny21  2021-03-01       0                   28
#2 ny21  2021-02-01       1                    0
#3 ny21  2021-01-01       0                    0
#4 ch67  2021-08-09       0                    0

For larger datasets, you can try data.table -

setDT(df)
df
df[, days_since_outcome_1 := as.integer(date - date[match(1, outcome)]), id]
df[, days_since_outcome_1 := replace(days_since_outcome_1, 
            days_since_outcome_1 < 0 | is.na(days_since_outcome_1), 0)]
df

Calculate time difference between observations for some rows only

Answers (1)

Related Questions