Achak
Achak

Reputation: 1296

replacing hourly missing value with yearly average of that hour of the day

I have a hourly dataset from 1996 - 2010 in the following format:

             date         value  
1- - -1996-01-01 00:00:00- - -  NA  
2- - -1996-01-01 01:00:00- - -  38  
3 - - -1996-01-01 02:00:00- - - 44  
4- - -1996-01-01 03:00:00- - -  48  
5- - -1996-01-01 04:00:00- - -  42  
6- - -1996-01-01 05:00:00- - -  44  
7- - - 1996-01-01 06:00:00- - - 38  
8- - - 1996-01-01 07:00:00- - - 42  
9- - -1996-01-01 08:00:00- - -  44  
10- - -1996-01-01 09:00:00- - - 44  

I have lot of missing hours data and I am tring to calculate the missing hour values by taking the average of previous and next hour, and if many hours missing I am trying to calculate by taking the average of that hour for every day that year, just wondering if this is possible?

I have tried the following but this gives me average of the complete dataset:

a = c(NA, 1, 2, 3, 10)   
a[which(is.na(a)==TRUE)] = mean(a,na.rm = T)

I would really appreciate any advice about how I should proceed with this calculation.

Upvotes: 0

Views: 280

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269471

na.aggregate in zoo does that. Its only one line of code to fill in the missing values:

# read in the data

Lines <- "1996-01-01 00:00:00 NA  
1996-01-01 01:00:00 38  
1996-01-01 02:00:00 43
1997-01-01 00:00:00 44  
1997-01-01 01:00:00 45"

library(zoo)
library(chron)
z <- read.zoo(text = Lines, index = 1:2, FUN = paste, FUN2 = as.chron)

# fill in the missing values

na.aggregate(z, hours, FUN = mean)

Upvotes: 2

joran
joran

Reputation: 173527

You could probably do this using some handy function from the zoo package. For instance, na.approx, with maxgap = 1 should linearly interpolate all the gaps of length one. Then you'd probably want to use na.aggregate, splitting by year and hour, to fill longer gaps with that periods' mean.

Here's a simple example to give you a sense of how these functions work:

set.seed(124)
tt <- as.POSIXct("2000-01-01 10:00:00") + 3600*c(1:100,10000:10100)
dd <- runif(201)

aa <- data.frame(x1 = tt,x2 = dd)
aa$x2[sample(201,30)] <- NA
aa$x3 <- na.approx(aa$x2,maxgap = 1)
aa$x4 <- na.aggregate(aa$x3,by = format(aa$x1,"%Y-%H"))

Note that if your series has leading or trailing NAs, you might get errors, since the "linear interpolation" piece doesn't make much sense in that case. So you'd have to fill those in some other way.

Upvotes: 1

Related Questions