john22
john22

Reputation: 57

Fill missing values in time series using previous day data - R

I have a data frame where each row is a different date and every column is different time series.
The date range in the table is 01.01.2019-01.01.2021.
Some of the time series are relevant for only part of the dates and have missing values on weekends and holidays.

How can I complete the missing values for each time series using previous day values only for the relevant dates of each column (if the time series in a specific column is from 01.03.2019 to 01.09.2019 I want to complete only the missing values in this dates range)?
In addition, if the time series stops for more than 5 days and then continues I want to stop the completion and then to restart the completion again.

I have tried to use the fill function:

data <- data %>%  
fill(colnames(data))  

but it completes also the missing data after the specific time series is over.

For example the df is:

#  Date         time_series_1           time_series_2            time_series_3
1  01-01-2019               NA                      10                       8
2  02-01-2019               5                       NA                       10
3  03-01-2019               10                      NA                       20   
4  04-01-2019               20                      6                        40
5  05-01-2019               30                      NA                       NA
6  06-01-2019               NA                      8                        NA
7  07-01-2019               7                       NA                       NA
8  08-01-2019               5                       NA                       NA
9  09-01-2019               NA                      NA                       5
10 10-01-2019               NA                      NA                       NA
11 11-01-2019               NA                      NA                       7
12 12-01-2019               NA                      NA                       10
13 13-01-2019               NA                      NA                       11
14 14-01-2019               NA                      NA                       12
15 15-01-2019               NA                      NA                       NA
16 16-01-2019               NA                      NA                       9
17 17-01-2019               NA                      NA                       10
18 18-01-2019               NA                      NA                       10
19 19-01-2019               5                       NA                       11
20 20-01-2019               NA                      NA                       NA
21 21-01-2019               5                       NA                       NA
22 22-01-2019               6                       NA                       NA

The desired output is:

#  Date         time_series_1           time_series_2            time_series_3
1  01-01-2019               NA                      10                       8
2  02-01-2019               5                       10                       10
3  03-01-2019               10                      10                       20   
4  04-01-2019               20                      6                        40
5  05-01-2019               30                      6                        40
6  06-01-2019               30                      8                        40
7  07-01-2019               7                       NA                       40
8  08-01-2019               5                       NA                       40
9  09-01-2019               NA                      NA                       5
10 10-01-2019               NA                      NA                       5
11 11-01-2019               NA                      NA                       7
12 12-01-2019               NA                      NA                       10
13 13-01-2019               NA                      NA                       11
14 14-01-2019               NA                      NA                       12
15 15-01-2019               NA                      NA                       12
16 16-01-2019               NA                      NA                       9
17 17-01-2019               NA                      NA                       10
18 18-01-2019               NA                      NA                       10
19 19-01-2019               5                       NA                       11
20 20-01-2019               5                       NA                       11
21 21-01-2019               5                       NA                       11
22 22-01-2019               6                       NA                       11

Upvotes: 2

Views: 1100

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

Edit

Thanks to @G. Grothendieck to mention that na.locf0 has maxgap argument which can handle the 5-day condition directly.

data[-1] <- lapply(data[-1], zoo::na.locf0, maxgap = 5)
data

Earlier Answer

You can write a function with rle and zoo::na.locf0 to replace NA only if the length of consecutive NA is less than equal to 5. Apply this function for multiple columns with lapply.

conditionally_replace_na <- function(x) {
  ifelse(with(rle(is.na(x)), rep(lengths, lengths)) <= 5 & is.na(x), 
               zoo::na.locf0(x), x)  
}

data[-1] <- lapply(data[-1], conditionally_replace_na)
data

#         Date time_series_1 time_series_2 time_series_3
#1  01-01-2019            NA            10             8
#2  02-01-2019             5            10            10
#3  03-01-2019            10            10            20
#4  04-01-2019            20             6            40
#5  05-01-2019            30             6            40
#6  06-01-2019            30             8            40
#7  07-01-2019             7            NA            40
#8  08-01-2019             5            NA            40
39  09-01-2019            NA            NA             5
#10 10-01-2019            NA            NA             5
#11 11-01-2019            NA            NA             7
#12 12-01-2019            NA            NA            10
#13 13-01-2019            NA            NA            11
#14 14-01-2019            NA            NA            12
#15 15-01-2019            NA            NA            12
#16 16-01-2019            NA            NA             9
#17 17-01-2019            NA            NA            10
#18 18-01-2019            NA            NA            10
#19 19-01-2019             5            NA            11
#20 20-01-2019             5            NA            11
#21 21-01-2019             5            NA            11
#22 22-01-2019             6            NA            11

Function can also be applied with dplyr::across

library(dplyr)
data %>% mutate(across(starts_with('time_series'), conditionally_replace_na))

Upvotes: 4

Related Questions