replacing hourly missing value with yearly average of that hour of the day

Question

I have a hourly dataset from 1996 - 2010 in the following format:

             date         value  
1- - -1996-01-01 00:00:00- - -  NA  
2- - -1996-01-01 01:00:00- - -  38  
3 - - -1996-01-01 02:00:00- - - 44  
4- - -1996-01-01 03:00:00- - -  48  
5- - -1996-01-01 04:00:00- - -  42  
6- - -1996-01-01 05:00:00- - -  44  
7- - - 1996-01-01 06:00:00- - - 38  
8- - - 1996-01-01 07:00:00- - - 42  
9- - -1996-01-01 08:00:00- - -  44  
10- - -1996-01-01 09:00:00- - - 44

I have lot of missing hours data and I am tring to calculate the missing hour values by taking the average of previous and next hour, and if many hours missing I am trying to calculate by taking the average of that hour for every day that year, just wondering if this is possible?

I have tried the following but this gives me average of the complete dataset:

a = c(NA, 1, 2, 3, 10)   
a[which(is.na(a)==TRUE)] = mean(a,na.rm = T)

I would really appreciate any advice about how I should proceed with this calculation.

joran · Accepted Answer

You could probably do this using some handy function from the zoo package. For instance, na.approx, with maxgap = 1 should linearly interpolate all the gaps of length one. Then you'd probably want to use na.aggregate, splitting by year and hour, to fill longer gaps with that periods' mean.

Here's a simple example to give you a sense of how these functions work:

set.seed(124)
tt <- as.POSIXct("2000-01-01 10:00:00") + 3600*c(1:100,10000:10100)
dd <- runif(201)

aa <- data.frame(x1 = tt,x2 = dd)
aa$x2[sample(201,30)] <- NA
aa$x3 <- na.approx(aa$x2,maxgap = 1)
aa$x4 <- na.aggregate(aa$x3,by = format(aa$x1,"%Y-%H"))

Note that if your series has leading or trailing NAs, you might get errors, since the "linear interpolation" piece doesn't make much sense in that case. So you'd have to fill those in some other way.

replacing hourly missing value with yearly average of that hour of the day

Answers (2)

Related Questions