Deduplicate observations based on window of time

Question

I have data on a large number of individuals and there may be multiple observations per person. I want to deduplicate the data into 'episodes' of 28 days for each individual. I want to drop those records where the date of the observation is 28 days or less than the date of the start of the prior episode.

Some sample data on 6 observations of a single individual are below. The duplicate and new_episode variables are dummy variables and are not present in the original data and indicate the logic of the example.

dat <- data.frame(id = rep(1, 6), spec_n = seq(1,6,1), 
                  spec_date = as.Date(c("2016/01/01", "2016/01/02", "2016/01/30",
                                        "2016/01/31", "2016/02/02", "2016/02/28")),
                  duplicate = c(0,1,0,1,1,0), new_episode = c(1,0,1,0,0,1),
                   stringsAsFactors = FALSE)
dat
  id spec_n  spec_date duplicate new_episode
1  1      1 2016-01-01         0           1
2  1      2 2016-01-02         1           0
3  1      3 2016-01-30         0           1
4  1      4 2016-01-31         1           0
5  1      5 2016-02-02         1           0
6  1      6 2016-02-28         0           1

With dplyr I can calculate the time since the last observation and the time since the first episode. So deduplicating on date_diff would not provide the data I require.

library(dplyr)
dat <- dat %>% group_by(id) %>% 
  mutate(date_diff = spec_date - lag(spec_date),
         earliest_spec_date = min(spec_date), 
         diff_earliest = spec_date - earliest_spec_date)
dat
     id spec_n  spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
                                              
1     1      1 2016-01-01         0           1   NA days         2016-01-01        0 days
2     1      2 2016-01-02         1           0    1 days         2016-01-01        1 days
3     1      3 2016-01-30         0           1   28 days         2016-01-01       29 days
4     1      4 2016-01-31         1           0    1 days         2016-01-01       30 days
5     1      5 2016-02-02         1           0    2 days         2016-01-01       32 days
6     1      6 2016-02-28         0           1   26 days         2016-01-01       58 days

However, this does not quite provide what I need. spec_n == 6 is less than 28 days since the previous observation, but more than 28 days since the start of the last episode (spec_n == 3).

Expected output would be those rows where duplicate is 0 or new_episode is 1, e.g.

     id spec_n  spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
                                              
1     1      1 2016-01-01         0           1   NA days         2016-01-01        0 days
2     1      3 2016-01-30         0           1   28 days         2016-01-01       29 days
3     1      6 2016-02-28         0           1   26 days         2016-01-01       58 days

gfgm · Accepted Answer

This should work (its an implementation of the idea Llopis suggested I think).

I make some simulated data first:

df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by=1), data=rnorm(365))
head(df)
        date       data
1 2015-01-01 -1.4493544
2 2015-01-02 -0.8860342
3 2015-01-03  1.3629541
4 2015-01-04 -2.0131108
5 2015-01-05 -0.4527413
6 2015-01-06  0.8428585

Now we write a function that takes the first date and checks if subsequent dates are more than 28 days distant from it, returning 0 if they are not and 1 if they are. If a date is 28 days away it takes that new date as the basis of future comparisons.

dupFinder <- function(x) {
  n <- 1
  N <- length(x)
  res <- rep(1, N)
  start <- x[n]
  while (n < (N)) {
    if (as.numeric(x[n+1]-start)>=28) {
      res[n+1] <- 1
      n <- n+1
      start <- x[n]
    }
    else {
      res[n+1] <- 0
      n <- n+1
    }
  }
  return(res)
}

The function dupFinder will return a vector of length equal to that of your dataframe, and you can then use it to subset the dataframe to the rows of interest. Thus:

df[dupFinder(df$date)==1,]
          date       data
1   2015-01-01 -1.4493544
29  2015-01-29  0.2084123
57  2015-02-26  1.4541566
85  2015-03-26  0.6794230
113 2015-04-23 -0.8285670
141 2015-05-21 -0.8686872
169 2015-06-18  2.1657994
197 2015-07-16 -1.1802231
225 2015-08-13  0.1808395
253 2015-09-10 -0.4762835
281 2015-10-08 -0.3769593
309 2015-11-05  0.2825544
337 2015-12-03 -0.7132649
365 2015-12-31 -1.8111226

As expected we start with the January 1, then January 29, then Feb 26, since Feb has 28 days we next get March 26th, etc.

Deduplicate observations based on window of time

Answers (1)

Related Questions