stasSajin
stasSajin

Reputation: 144

Expand by ID for future periods only

Is there a way to fill in for implicit missingness for future dates based on id?

For example, imagine a experiment that starts in Jan-2016. I have 3 participants that join in at different periods. Subject 1 joins me in Jan and continues to stay until Aug. Subj 2 joins me in March, and stays in the experiment until August. Subject 3 also joins me in March, but drops out sometime in in May, so no observations are recorded for periods May-Aug.

The question is, how do I fill in the dates when subject 3 dropped out of the experiment? Here is some mock data for how things look like:

   Subject   Date
1        1 Jan-16
2        1 Feb-16
3        1 Mar-16
4        1 Apr-16
5        1 May-16
6        1 Jun-16
7        1 Jul-16
8        1 Aug-16
9        2 Mar-16
10       2 Apr-16
11       2 May-16
12       2 Jun-16
13       2 Jul-16
14       2 Aug-16
15       3 Mar-16
16       3 Apr-16




structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L), Date = structure(c(5L, 4L, 8L, 2L, 
9L, 7L, 6L, 3L, 8L, 2L, 9L, 7L, 6L, 3L, 8L, 2L), .Label = c("", 
"Apr-16", "Aug-16", "Feb-16", "Jan-16", "Jul-16", "Jun-16", "Mar-16", 
"May-16"), class = "factor")), class = "data.frame", row.names = c(NA, 
-16L), .Names = c("Subject", "Date"))

And here is how the data should look like:

   Subject   Date
1        1 Jan-16
2        1 Feb-16
3        1 Mar-16
4        1 Apr-16
5        1 May-16
6        1 Jun-16
7        1 Jul-16
8        1 Aug-16
9        2 Mar-16
10       2 Apr-16
11       2 May-16
12       2 Jun-16
13       2 Jul-16
14       2 Aug-16
15       3 Mar-16
16       3 Apr-16
17       3 May-16
18       3 Jun-16
19       3 Jul-16
20       3 Aug-16

structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Date = structure(c(4L, 
3L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L, 
6L, 5L, 2L), .Label = c("Apr-16", "Aug-16", "Feb-16", "Jan-16", 
"Jul-16", "Jun-16", "Mar-16", "May-16"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L), .Names = c("Subject", "Date"))

I tried using expand from tidyr and TimeFill from DataCombine package, but the issue with those approaches is that I would get dates for periods before a participant joined an experiment. In this particular instance, I only want the periods to be filled for cases when a participant drops out of an experiment.

Upvotes: 1

Views: 54

Answers (1)

Axeman
Axeman

Reputation: 35307

The complete function from tidyr is designed for turning implicit missing values into explicit missing values. We will have to do some filtering to not include past completion. The easiest way seems to be to do a join on a table with starting values:

library(dplyr)
library(tidyr)

df <- df %>% 
  filter(Date != '') %>% 
  droplevels() %>% 
  group_by(Subject)

df2 <- summarise(df, start = first(Date))

df %>%
  complete(Subject, Date) %>%
  left_join(df2) %>% 
  mutate(Date2 = as.Date(paste0('01-', Date), format = '%d-%b-%y'),
         start = as.Date(paste0('01-', start), format = '%d-%b-%y')) %>% 
  filter(Date2 >= start) %>%
  arrange(Subject, Date2) %>% 
  select(-start, -Date2)

Result:

Source: local data frame [20 x 2]
Groups: Subject [3]

   Subject   Date
     <int> <fctr>
1        1 Jan-16
2        1 Feb-16
3        1 Mar-16
4        1 Apr-16
5        1 May-16
6        1 Jun-16
7        1 Jul-16
8        1 Aug-16
9        2 Mar-16
10       2 Apr-16
11       2 May-16
12       2 Jun-16
13       2 Jul-16
14       2 Aug-16
15       3 Mar-16
16       3 Apr-16
17       3 May-16
18       3 Jun-16
19       3 Jul-16
20       3 Aug-16

I use date conversion to do a reliable comparison with the starting date, but you might be able to use the row_numbers somehow. A problem is that complete will rearrange the data.

p.s. Note that your example dput has an empty factor level (""), so I filter that out first.

Upvotes: 1

Related Questions