Reputation: 1459
I'm working with a dataset of weather variables (temperature, precipitation, etc.) that has a few missing values. Because of my specific approach (summing these variables across several days), I need to address NA values in the dataset.
When there is a missing daily value, I'd like to fill that day with a mean value of the previous and following day. The assumption here is that weather values are similar from one day to the next. And yes, I realize this is a big assumption.
I've developed the following:
maxTemp <- c(13.2, 10.7, NA, 17.9, 6.6, 10, 13, NA, NA, 8.8, 9.9, 14.9, 16.3, NA, 18, 9.9, 11.5, 15.3, 21.7, 23.9, 26.6, 27, 22.3, NA, 17.9)
weather <- as.data.frame(maxTemp)
weather %>%
mutate(maxTempNA = if_else(is.na(maxTemp),
(lag(maxTemp) + lead(maxTemp))/2,
maxTemp))
However, in a few cases, I have two NA values on consecutive days, so this doesn't work. Any thoughts on approaches to code this so that when there are two (or more) NA's in a row, the average uses the 'bookending' values to fill the NAs?
The final result would do look like this:
maxTemp <- c(13.2, 10.7, 14.3, 17.9, 6.6, 10, 13, 10.9, 10.9, 8.8, 9.9, 14.9, 16.3, 17.15, 18, 9.9, 11.5, 15.3, 21.7, 23.9, 26.6, 27, 22.3, 20.1, 17.9)
Upvotes: 2
Views: 220
Reputation: 50668
How about using approx
to replace NA
s with interpolated values; by default, approx
uses linear interpolation, so this should match your manual replace-by-mean results.
weather %>%
mutate(maxTemp_interp = approx(1:n(), maxTemp, 1:n())$y)
# maxTemp maxTemp_interp
# 1 13.2 13.20
# 2 10.7 10.70
# 3 NA 14.30
# 4 17.9 17.90
# 5 6.6 6.60
# 6 10.0 10.00
# 7 13.0 13.00
# 8 NA 11.60
# 9 NA 10.20
# 10 8.8 8.80
# 11 9.9 9.90
# 12 14.9 14.90
# 13 16.3 16.30
# 14 NA 17.15
# 15 18.0 18.00
# 16 9.9 9.90
# 17 11.5 11.50
# 18 15.3 15.30
# 19 21.7 21.70
# 20 23.9 23.90
# 21 26.6 26.60
# 22 27.0 27.00
# 23 22.3 22.30
# 24 NA 20.10
# 25 17.9 17.90
I've created a new column here to make it easier to compare with the original data.
Markus pointed out in the comments (thanks @markus) that to reproduce your expected output, you'd actually need method = "constant"
with f = 0.5
:
weather %>%
mutate(maxTemp_interp = approx(1:n(), maxTemp, 1:n(), method = "constant", f = 0.5)$y)
# maxTemp maxTemp_interp
# 1 13.2 13.20
# 2 10.7 10.70
# 3 NA 14.30
# 4 17.9 17.90
# 5 6.6 6.60
# 6 10.0 10.00
# 7 13.0 13.00
# 8 NA 10.90
# 9 NA 10.90
# 10 8.8 8.80
# 11 9.9 9.90
# 12 14.9 14.90
# 13 16.3 16.30
# 14 NA 17.15
# 15 18.0 18.00
# 16 9.9 9.90
# 17 11.5 11.50
# 18 15.3 15.30
# 19 21.7 21.70
# 20 23.9 23.90
# 21 26.6 26.60
# 22 27.0 27.00
# 23 22.3 22.30
# 24 NA 20.10
# 25 17.9 17.90
Upvotes: 3
Reputation: 60060
If you want to use the mean of the most recent non-NA value going backwards and forwards, you can use something like data.table::nafill()
to fill values both down and up, and then take the mean:
weather$prevTemp = data.table::nafill(weather$maxTemp, type = "locf")
weather$nextTemp = data.table::nafill(weather$maxTemp, type = "nocb")
weather$maxTemp[is.na(weather$maxTemp)] = ((weather$prevTemp + weather$nextTemp) / 2)[is.na(weather$maxTemp)]
Upvotes: 3