Rizwan
Rizwan

Reputation: 27

Filtering dates from different columns depending on conditions

An example of my dataset:

library('lubridate')
id<-c(1,1,2,2,2,3,3,3)
admidate<-c(dmy('01/01/2010', '01/01/2011', '01/02/2010', '01/02/2011', '01/01/2012', '01/01/2010', '01/03/2010', '01/03/2011'))
disdate<-c(dmy('15/01/2010', '15/01/2011', '15/02/2010', '15/02/2011', '15/02/2012', '15/01/2010', '15/03/2010', '15/03/2011'))
first_adm<-c(1,0,1,0,0,1,0,0)
last_adm<-c(0,1,0,0,1,0,0,1)
intervention<-c(1,0,0,1,0,0,0,0)
df<-data.frame(id, admidate, disdate, first_adm, last_adm, intervention)

For each id, I want to get a column of dates where:

  1. If intervention == 1, time difference between admidate for this row and last_adm==1 is returned.
  2. If intervention == 0, time difference between admidate when first_adm==1 for that id and last_adm==1 for that id is returned.

Each id has only one first_adm and only one last_adm.

The purpose is to get follow times to compare between the two groups. For one it starts when they had an 'intervention'. For others who did not have intervention, it starts at first contact.

The df finally would look something like this:

 id   admidate    disdate  first_adm last_adm   intervention fu_time
  1 2010-01-01 2010-01-15         1        0            1     -
  1 2011-01-01 2011-01-15         0        1            0     365
  2 2010-02-01 2010-02-15         1        0            0     -
  2 2011-01-01 2011-02-15         0        0            1     -
  2 2012-01-01 2012-02-15         0        1            0     365
  3 2010-01-01 2010-01-15         1        0            0     -
  3 2010-03-01 2010-03-15         0        0            0     -
  3 2011-03-01 2011-03-15         0        1            0     431

Upvotes: 1

Views: 89

Answers (1)

Uwe
Uwe

Reputation: 42544

This is what I would try with my preferred tool.

data.table version

library(data.table)
setDT(df)[, {
  fa <- first(admidate[first_adm == 1])
  la <- last(admidate[last_adm == 1]) 
  fi <- first(admidate[intervention == 1])
  ai <- any(intervention == 1)
  .(fu_time = la - if (ai) pmax(fa, fi) else fa)
}, by = id]

which simply returns the result of each id

   id  fu_time
1:  1 365 days
2:  2 334 days
3:  3 424 days

Note that we still use first() and last() here as we are aggregating by id although the OP has assured that each id has only one first_adm and only one last_adm.

However the OP wants to have the original df updated. With data.table, we can append a new column where each row contains the result for the respective group:

setDT(df)[, fu_time := {
  fa <- first(admidate[first_adm == 1])
  la <- last(admidate[last_adm == 1]) 
  fi <- first(admidate[intervention == 1])
  ai <- any(intervention == 1)
  la - if (ai) pmax(fa, fi) else fa
}, by = id]

Now, the original df has been updated:

df[]
   id   admidate    disdate first_adm last_adm intervention  fu_time
1:  1 2010-01-01 2010-01-15         1        0            1 365 days
2:  1 2011-01-01 2011-01-15         0        1            0 365 days
3:  2 2010-02-01 2010-02-15         1        0            0 334 days
4:  2 2011-02-01 2011-02-15         0        0            1 334 days
5:  2 2012-01-01 2012-02-15         0        1            0 334 days
6:  3 2010-01-01 2010-01-15         1        0            0 424 days
7:  3 2010-03-01 2010-03-15         0        0            0 424 days
8:  3 2011-03-01 2011-03-15         0        1            0 424 days

If only the last row is to contain the result for each id and if we want to fully simulate OP's expected result:

setDT(df)[, fu_time := {
  fa <- first(admidate[first_adm == 1])
  la <- last(admidate[last_adm == 1]) 
  fi <- first(admidate[intervention == 1])
  ai <- any(intervention == 1)
  replace(rep("-", .N), .N, la - if (ai) pmax(fa, fi) else fa)
}, by = id]
df[]
   id   admidate    disdate first_adm last_adm intervention fu_time
1:  1 2010-01-01 2010-01-15         1        0            1       -
2:  1 2011-01-01 2011-01-15         0        1            0     365
3:  2 2010-02-01 2010-02-15         1        0            0       -
4:  2 2011-02-01 2011-02-15         0        0            1       -
5:  2 2012-01-01 2012-02-15         0        1            0     334
6:  3 2010-01-01 2010-01-15         1        0            0       -
7:  3 2010-03-01 2010-03-15         0        0            0       -
8:  3 2011-03-01 2011-03-15         0        1            0     424

dplyr version

As the OP has tagged the question with mutate I suppose the OP might prefer a dplyr approach:

library(dplyr)
df %>% 
  group_by(id) %>% 
  summarise(
    fu_time = last(admidate[last_adm == 1]) - 
      if (any(intervention == 1))
        pmax(first(admidate[first_adm == 1]), first(admidate[intervention == 1]))
    else first(admidate[first_adm == 1])
  ) %>% 
  inner_join(df, .)
  id   admidate    disdate first_adm last_adm intervention  fu_time
1  1 2010-01-01 2010-01-15         1        0            1 365 days
2  1 2011-01-01 2011-01-15         0        1            0 365 days
3  2 2010-02-01 2010-02-15         1        0            0 334 days
4  2 2011-02-01 2011-02-15         0        0            1 334 days
5  2 2012-01-01 2012-02-15         0        1            0 334 days
6  3 2010-01-01 2010-01-15         1        0            0 424 days
7  3 2010-03-01 2010-03-15         0        0            0 424 days
8  3 2011-03-01 2011-03-15         0        1            0 424 days

Note that this will not update df but creates a new object.

Caveat

It is implicitely assumed that the dataset is already ordered by ascending admidate within each id group.

Upvotes: 2

Related Questions