Reputation: 1296
I have a hourly dataset from 1996 - 2010 in the following format:
date value
1- - -1996-01-01 00:00:00- - - NA
2- - -1996-01-01 01:00:00- - - 38
3 - - -1996-01-01 02:00:00- - - 44
4- - -1996-01-01 03:00:00- - - 48
5- - -1996-01-01 04:00:00- - - 42
6- - -1996-01-01 05:00:00- - - 44
7- - - 1996-01-01 06:00:00- - - 38
8- - - 1996-01-01 07:00:00- - - 42
9- - -1996-01-01 08:00:00- - - 44
10- - -1996-01-01 09:00:00- - - 44
I have lot of missing hours data and I am tring to calculate the missing hour values by taking the average of previous and next hour, and if many hours missing I am trying to calculate by taking the average of that hour for every day that year, just wondering if this is possible?
I have tried the following but this gives me average of the complete dataset:
a = c(NA, 1, 2, 3, 10)
a[which(is.na(a)==TRUE)] = mean(a,na.rm = T)
I would really appreciate any advice about how I should proceed with this calculation.
Upvotes: 0
Views: 280
Reputation: 269471
na.aggregate
in zoo does that. Its only one line of code to fill in the missing values:
# read in the data
Lines <- "1996-01-01 00:00:00 NA
1996-01-01 01:00:00 38
1996-01-01 02:00:00 43
1997-01-01 00:00:00 44
1997-01-01 01:00:00 45"
library(zoo)
library(chron)
z <- read.zoo(text = Lines, index = 1:2, FUN = paste, FUN2 = as.chron)
# fill in the missing values
na.aggregate(z, hours, FUN = mean)
Upvotes: 2
Reputation: 173527
You could probably do this using some handy function from the zoo package. For instance, na.approx
, with maxgap = 1
should linearly interpolate all the gaps of length one. Then you'd probably want to use na.aggregate
, splitting by year and hour, to fill longer gaps with that periods' mean.
Here's a simple example to give you a sense of how these functions work:
set.seed(124)
tt <- as.POSIXct("2000-01-01 10:00:00") + 3600*c(1:100,10000:10100)
dd <- runif(201)
aa <- data.frame(x1 = tt,x2 = dd)
aa$x2[sample(201,30)] <- NA
aa$x3 <- na.approx(aa$x2,maxgap = 1)
aa$x4 <- na.aggregate(aa$x3,by = format(aa$x1,"%Y-%H"))
Note that if your series has leading or trailing NA
s, you might get errors, since the "linear interpolation" piece doesn't make much sense in that case. So you'd have to fill those in some other way.
Upvotes: 1