Reputation: 57
I have a data frame where each row is a different date and every column is different time series.
The date range in the table is 01.01.2019-01.01.2021.
Some of the time series are relevant for only part of the dates and have missing values on weekends and holidays.
How can I complete the missing values for each time series using previous day values only for the relevant dates of each column (if the time series in a specific column is from 01.03.2019 to 01.09.2019 I want to complete only the missing values in this dates range)?
In addition, if the time series stops for more than 5 days and then continues I want to stop the completion and then to restart the completion again.
I have tried to use the fill function:
data <- data %>%
fill(colnames(data))
but it completes also the missing data after the specific time series is over.
For example the df is:
# Date time_series_1 time_series_2 time_series_3
1 01-01-2019 NA 10 8
2 02-01-2019 5 NA 10
3 03-01-2019 10 NA 20
4 04-01-2019 20 6 40
5 05-01-2019 30 NA NA
6 06-01-2019 NA 8 NA
7 07-01-2019 7 NA NA
8 08-01-2019 5 NA NA
9 09-01-2019 NA NA 5
10 10-01-2019 NA NA NA
11 11-01-2019 NA NA 7
12 12-01-2019 NA NA 10
13 13-01-2019 NA NA 11
14 14-01-2019 NA NA 12
15 15-01-2019 NA NA NA
16 16-01-2019 NA NA 9
17 17-01-2019 NA NA 10
18 18-01-2019 NA NA 10
19 19-01-2019 5 NA 11
20 20-01-2019 NA NA NA
21 21-01-2019 5 NA NA
22 22-01-2019 6 NA NA
The desired output is:
# Date time_series_1 time_series_2 time_series_3
1 01-01-2019 NA 10 8
2 02-01-2019 5 10 10
3 03-01-2019 10 10 20
4 04-01-2019 20 6 40
5 05-01-2019 30 6 40
6 06-01-2019 30 8 40
7 07-01-2019 7 NA 40
8 08-01-2019 5 NA 40
9 09-01-2019 NA NA 5
10 10-01-2019 NA NA 5
11 11-01-2019 NA NA 7
12 12-01-2019 NA NA 10
13 13-01-2019 NA NA 11
14 14-01-2019 NA NA 12
15 15-01-2019 NA NA 12
16 16-01-2019 NA NA 9
17 17-01-2019 NA NA 10
18 18-01-2019 NA NA 10
19 19-01-2019 5 NA 11
20 20-01-2019 5 NA 11
21 21-01-2019 5 NA 11
22 22-01-2019 6 NA 11
Upvotes: 2
Views: 1100
Reputation: 388817
Edit
Thanks to @G. Grothendieck to mention that na.locf0
has maxgap
argument which can handle the 5-day condition directly.
data[-1] <- lapply(data[-1], zoo::na.locf0, maxgap = 5)
data
Earlier Answer
You can write a function with rle
and zoo::na.locf0
to replace NA
only if the length of consecutive NA
is less than equal to 5. Apply this function for multiple columns with lapply
.
conditionally_replace_na <- function(x) {
ifelse(with(rle(is.na(x)), rep(lengths, lengths)) <= 5 & is.na(x),
zoo::na.locf0(x), x)
}
data[-1] <- lapply(data[-1], conditionally_replace_na)
data
# Date time_series_1 time_series_2 time_series_3
#1 01-01-2019 NA 10 8
#2 02-01-2019 5 10 10
#3 03-01-2019 10 10 20
#4 04-01-2019 20 6 40
#5 05-01-2019 30 6 40
#6 06-01-2019 30 8 40
#7 07-01-2019 7 NA 40
#8 08-01-2019 5 NA 40
39 09-01-2019 NA NA 5
#10 10-01-2019 NA NA 5
#11 11-01-2019 NA NA 7
#12 12-01-2019 NA NA 10
#13 13-01-2019 NA NA 11
#14 14-01-2019 NA NA 12
#15 15-01-2019 NA NA 12
#16 16-01-2019 NA NA 9
#17 17-01-2019 NA NA 10
#18 18-01-2019 NA NA 10
#19 19-01-2019 5 NA 11
#20 20-01-2019 5 NA 11
#21 21-01-2019 5 NA 11
#22 22-01-2019 6 NA 11
Function can also be applied with dplyr::across
library(dplyr)
data %>% mutate(across(starts_with('time_series'), conditionally_replace_na))
Upvotes: 4