Eric Green
Eric Green

Reputation: 7725

replacing missing values by group conditional on date/time

I want to group by person and date1 and fill in missing data for date2 and indicator by person and day IF with the person's next observation occurs in the same day.

For instance, person 1 is missing date2 and indicator values for the second and third observations. As shown below, I want to replace these missing values with the next non-NA observation in the same day for this person: date2==2018-02-02 15:04:00 and indicator==1.

Note that for person 2, the last NA does not have a next observation in the same day, so it needs to remain NA.

Here is the data frame I have:

  person               date1               date2 indicator
1      1 2018-02-02 12:00:00 2018-02-02 12:05:00         1
2      1 2018-02-02 13:00:00                <NA>        NA
3      1 2018-02-02 14:00:00                <NA>        NA
4      1 2018-02-02 15:00:00 2018-02-02 15:04:00         1
5      2 2018-02-01 12:00:00                <NA>        NA
6      2 2018-02-01 13:00:00 2018-02-01 13:06:00         1
7      2 2018-02-02 12:00:00 2018-02-02 12:03:00         1
8      2 2018-02-03 12:00:00                <NA>        NA

Here is the data frame I want:

  person               date1               date2 indicator
1      1 2018-02-02 12:00:00 2018-02-02 12:05:00         1
2      1 2018-02-02 13:00:00 2018-02-02 15:04:00         1
3      1 2018-02-02 14:00:00 2018-02-02 15:04:00         1
4      1 2018-02-02 15:00:00 2018-02-02 15:04:00         1
5      2 2018-02-01 12:00:00 2018-02-01 13:06:00         1
6      2 2018-02-01 13:00:00 2018-02-01 13:06:00         1
7      2 2018-02-02 12:00:00 2018-02-02 12:03:00         1
8      2 2018-02-03 12:00:00                <NA>        NA

Example:

library(tidyverse)
df.have <- data.frame(person=c(1, 1, 1, 1, 2, 2, 2, 2),
                      date1=ymd_hms(c("2018-02-02 12:00:00", 
                                      "2018-02-02 13:00:00", 
                                      "2018-02-02 14:00:00", 
                                      "2018-02-02 15:00:00",
                                      "2018-02-01 12:00:00", 
                                      "2018-02-01 13:00:00", 
                                      "2018-02-02 12:00:00", 
                                      "2018-02-03 12:00:00")),
                       date2=ymd_hms(c("2018-02-02 12:05:00", 
                                       NA, 
                                       NA, 
                                       "2018-02-02 15:04:00",
                                       NA, 
                                       "2018-02-01 13:06:00", 
                                       "2018-02-02 12:03:00", 
                                       NA)),
                       indicator=c(1, NA, NA, 1,
                                   NA, 1, 1, NA))

df.want <- data.frame(person=c(1, 1, 1, 1, 2, 2, 2, 2),
                      date1=ymd_hms(c("2018-02-02 12:00:00", 
                                      "2018-02-02 13:00:00", 
                                      "2018-02-02 14:00:00", 
                                      "2018-02-02 15:00:00",
                                      "2018-02-01 12:00:00", 
                                      "2018-02-01 13:00:00", 
                                      "2018-02-02 12:00:00", 
                                      "2018-02-03 12:00:00")),
                      date2=ymd_hms(c("2018-02-02 12:05:00", 
                                      "2018-02-02 15:04:00",
                                      "2018-02-02 15:04:00", 
                                      "2018-02-02 15:04:00",
                                      "2018-02-01 13:06:00", 
                                      "2018-02-01 13:06:00", 
                                      "2018-02-02 12:03:00", 
                                      NA)),
                      indicator=c(1, 1, 1, 1,
                                  1, 1, 1, NA))

I can filter down to some of the replacement values, but still a good bit from where I want to get.

df.have %>%
  group_by(person, date(date1)) %>%
  arrange(person, date1) %>%
  filter(row_number() %in% c(n()))

Upvotes: 1

Views: 126

Answers (1)

Andrew Gustar
Andrew Gustar

Reputation: 18425

You can do it like this (note that you also need lubridate as well as the tidyverse packages)...

df.want <- df.have %>% mutate(day=date(date1)) %>% #add a date variable for grouping
                       group_by(day,person) %>% 
                       fill(date2,indicator,.direction = "up") %>% #use tidyr 'fill' to remove NAs
                       ungroup() %>% 
                       select(-day) %>% #remove grouping variable
                       arrange(person,date1) #restore original order

df.want

# A tibble: 8 x 4
  person               date1               date2 indicator
   <dbl>              <dttm>              <dttm>     <dbl>
1      1 2018-02-02 12:00:00 2018-02-02 12:05:00         1
2      1 2018-02-02 13:00:00 2018-02-02 15:04:00         1
3      1 2018-02-02 14:00:00 2018-02-02 15:04:00         1
4      1 2018-02-02 15:00:00 2018-02-02 15:04:00         1
5      2 2018-02-01 12:00:00 2018-02-01 13:06:00         1
6      2 2018-02-01 13:00:00 2018-02-01 13:06:00         1
7      2 2018-02-02 12:00:00 2018-02-02 12:03:00         1
8      2 2018-02-03 12:00:00                  NA        NA

Upvotes: 2

Related Questions