user14294024
user14294024

Reputation: 13

Is there a way to subset data by keeping the first observation of one condition and the previous observation prior to the onset of a condition?

I have a dataset that contains multiple observations per individual and observations were taken yearly. There are 2 conditions (TRUE or FALSE). Some individuals have multiple observations per condition, others have multiple observations for only 1 condition:

ID = c(1,1,1,2,2,3,3,3,4,4,4,5,5,5,5)
Cond = c(FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE)
date = as.Date(c("2017-05-02", "2018-07-04", "2019-06-18","2011-05-05", "2012-08-09", "2015-06-13", "2016-02-08", "2017-07-22", "2014-05-03", "2015-09-05", "2016-04-07", "2016-03-06", "2017-07-23", "2018-09-28", "2019-05-29"))
df = data.frame(ID, Cond, date)

Which then looks like this:

   ID  Cond       date
1   1 FALSE 2017-05-02
2   1  TRUE 2018-07-04
3   1  TRUE 2019-06-18
4   2 FALSE 2011-05-05
5   2 FALSE 2012-08-09
6   3 FALSE 2015-06-13
7   3 FALSE 2016-02-08
8   3  TRUE 2017-07-22
9   4 FALSE 2014-05-03
10  4 FALSE 2015-09-05
11  4 FALSE 2016-04-07
12  5 FALSE 2016-03-06
13  5 FALSE 2017-07-23
14  5  TRUE 2018-09-28
15  5  TRUE 2019-05-29

I want to subset the data so that each individual has one row for the last observation where the condition was FALSE and a second row for the first observation where the condition is TRUE, excluding all individuals who only have observations for one condition. The above example would then look like:

  ID  Cond       date
1  1 FALSE 2017-05-02
2  1  TRUE 2018-07-04
3  3 FALSE 2016-02-08
4  3  TRUE 2017-07-22
5  5 FALSE 2017-07-23
6  5  TRUE 2018-09-28

I'm just not sure on how to do this with a large dataset.

Upvotes: 1

Views: 411

Answers (1)

Ben
Ben

Reputation: 30494

I suspect there is likely a better way, but here is one approach using dplyr. First, group_by each ID and then use slice for obtain the last Cond which is FALSE using tail, and the first Cond which is TRUE using head. Finally, only include results where there are two rows for a given ID.

library(dplyr)

df %>%
  arrange(ID, date) %>%
  group_by(ID) %>%
  slice(c(tail(which(!Cond), 1), head(which(Cond), 1))) %>%
  filter(n() == 2)

Output

     ID Cond  date      
  <dbl> <lgl> <date>    
1     1 FALSE 2017-05-02
2     1 TRUE  2018-07-04
3     3 FALSE 2016-02-08
4     3 TRUE  2017-07-22
5     5 FALSE 2017-07-23
6     5 TRUE  2018-09-28

Upvotes: 1

Related Questions