tnt
tnt

Reputation: 1459

fill NA values with mean of preceding and subsequent values

I'm working with a dataset of weather variables (temperature, precipitation, etc.) that has a few missing values. Because of my specific approach (summing these variables across several days), I need to address NA values in the dataset.

When there is a missing daily value, I'd like to fill that day with a mean value of the previous and following day. The assumption here is that weather values are similar from one day to the next. And yes, I realize this is a big assumption.

I've developed the following:

maxTemp <- c(13.2, 10.7, NA, 17.9, 6.6, 10, 13, NA, NA, 8.8, 9.9, 14.9, 16.3, NA, 18, 9.9, 11.5, 15.3, 21.7, 23.9, 26.6, 27, 22.3, NA, 17.9)
weather <- as.data.frame(maxTemp)
weather %>% 
  mutate(maxTempNA = if_else(is.na(maxTemp),
                             (lag(maxTemp) + lead(maxTemp))/2,
                             maxTemp))

However, in a few cases, I have two NA values on consecutive days, so this doesn't work. Any thoughts on approaches to code this so that when there are two (or more) NA's in a row, the average uses the 'bookending' values to fill the NAs?

The final result would do look like this:

maxTemp <- c(13.2, 10.7, 14.3, 17.9, 6.6, 10, 13, 10.9, 10.9, 8.8, 9.9, 14.9, 16.3, 17.15, 18, 9.9, 11.5, 15.3, 21.7, 23.9, 26.6, 27, 22.3, 20.1, 17.9)

Upvotes: 2

Views: 220

Answers (2)

Maurits Evers
Maurits Evers

Reputation: 50668

How about using approx to replace NAs with interpolated values; by default, approx uses linear interpolation, so this should match your manual replace-by-mean results.

weather %>%
    mutate(maxTemp_interp = approx(1:n(), maxTemp, 1:n())$y)
#    maxTemp maxTemp_interp
# 1     13.2          13.20
# 2     10.7          10.70
# 3       NA          14.30
# 4     17.9          17.90
# 5      6.6           6.60
# 6     10.0          10.00
# 7     13.0          13.00
# 8       NA          11.60
# 9       NA          10.20
# 10     8.8           8.80
# 11     9.9           9.90
# 12    14.9          14.90
# 13    16.3          16.30
# 14      NA          17.15
# 15    18.0          18.00
# 16     9.9           9.90
# 17    11.5          11.50
# 18    15.3          15.30
# 19    21.7          21.70
# 20    23.9          23.90
# 21    26.6          26.60
# 22    27.0          27.00
# 23    22.3          22.30
# 24      NA          20.10
# 25    17.9          17.90

I've created a new column here to make it easier to compare with the original data.


Update

Markus pointed out in the comments (thanks @markus) that to reproduce your expected output, you'd actually need method = "constant" with f = 0.5:

weather %>%
    mutate(maxTemp_interp = approx(1:n(), maxTemp, 1:n(), method = "constant", f = 0.5)$y)
#    maxTemp maxTemp_interp
# 1     13.2          13.20
# 2     10.7          10.70
# 3       NA          14.30
# 4     17.9          17.90
# 5      6.6           6.60
# 6     10.0          10.00
# 7     13.0          13.00
# 8       NA          10.90
# 9       NA          10.90
# 10     8.8           8.80
# 11     9.9           9.90
# 12    14.9          14.90
# 13    16.3          16.30
# 14      NA          17.15
# 15    18.0          18.00
# 16     9.9           9.90
# 17    11.5          11.50
# 18    15.3          15.30
# 19    21.7          21.70
# 20    23.9          23.90
# 21    26.6          26.60
# 22    27.0          27.00
# 23    22.3          22.30
# 24      NA          20.10
# 25    17.9          17.90

Upvotes: 3

Marius
Marius

Reputation: 60060

If you want to use the mean of the most recent non-NA value going backwards and forwards, you can use something like data.table::nafill() to fill values both down and up, and then take the mean:

weather$prevTemp = data.table::nafill(weather$maxTemp, type = "locf")
weather$nextTemp = data.table::nafill(weather$maxTemp, type = "nocb")
weather$maxTemp[is.na(weather$maxTemp)] = ((weather$prevTemp + weather$nextTemp) / 2)[is.na(weather$maxTemp)]

Upvotes: 3

Related Questions