watINwatOUT
watINwatOUT

Reputation: 25

R Issue adding NA values for missing rows using data frames

Thanks in advance for any help that is provided.

Long story short: I am working with hourly time series data from a measurement device (exported from SQL then imported in to R in order to properly format the date time ) - the time series contains missing data, sometimes in groups, and I need to locate these missing rows/indices and insert a new row for each instance that holds an NA value.

Related Questions that did not solve my problem:

how to insert missing observations on a data frame

Adding row to a data frame with missing values

Problem Data

The dataset that I am working with in this case is fairly large and varies depending on the measurement device I select. As a test case, I have one time series that contains 17469 hourly observations. I located a small section of the dataset that may be used for testing purposes. Here it is:

> snip
                   date Reading
408 2015-12-15 00:00:00    4.40
409 2015-12-14 23:00:00    4.62
410 2015-12-14 22:00:00    4.61
411 2015-12-14 21:00:00    6.15
412 2015-12-14 20:00:00    6.06
413 2015-12-14 19:00:00    7.04
414 2015-12-14 18:00:00    8.57
415 2015-12-14 11:00:00    4.12
416 2015-12-14 10:00:00    3.73

We can see that observations are missing for 2015-12-14 12:00:00 to 2015-12-14 17:00:00. I would like to first locate then populate the time series with these date times and input NA for the Reading column in these positions. I would also like to return the indices that are missing in an additional vector.

How can this be done?

So far I have tried the following code (as suggested here, how to add a missing dates and remove repeated dates in hourly time series), but all I end up with is NA values when I perform the merge function and still need to identify where the missing indices are located.

Here is the result:

> grid = data.frame(date=seq.POSIXt(min(snip[,1]), to=max(snip[,1]), by="1 hours"));
> dat = merge(grid, snip, by="date", all.x=TRUE)
> dat
                  date Reading
1  2015-12-14 10:00:00      NA
2  2015-12-14 11:00:00      NA
3  2015-12-14 12:00:00      NA
4  2015-12-14 13:00:00      NA
5  2015-12-14 14:00:00      NA
6  2015-12-14 15:00:00      NA
7  2015-12-14 16:00:00      NA
8  2015-12-14 17:00:00      NA
9  2015-12-14 18:00:00      NA
10 2015-12-14 19:00:00      NA
11 2015-12-14 20:00:00      NA
12 2015-12-14 21:00:00      NA
13 2015-12-14 22:00:00      NA
14 2015-12-14 23:00:00      NA
15 2015-12-15 00:00:00      NA

What am I missing here? Is it because grid and snip$date are in reverse order? For additional information here is what the date time format looks like (in case this is from where my issue stems):

> snip[2,1]
[1] "2015-12-14 23:00:00 GMT"

The result of the dput(snip) command is as follows (thanks for the suggestion @42):

> dput(snip)
structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0, 
0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L, 
23L, 22L, 21L, 20L, 19L, 18L, 11L, 10L), mday = c(15L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L), mon = c(11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 
115L, 115L, 115L), wday = c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), yday = c(348L, 347L, 347L, 347L, 347L, 347L, 347L, 347L, 347L
), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Reading = c(4.4, 
4.62, 4.61, 6.15, 6.06, 7.04, 8.57, 4.12, 3.73)), .Names = c("date", 
"Reading"), row.names = 408:416, class = "data.frame")

Upvotes: 0

Views: 1444

Answers (1)

R.S.
R.S.

Reputation: 2140

Here's how I was able to do it with some help from na.locf documentation. Does it help?

dat<- dget("yoursample")
require(xts)
datxts<- as.xts(dat[,-1],order.by = dat$date,frequency = 24)
tzn<-tzone(datxts)
g<- seq(start(datxts), end(datxts), "hour")
gxts<- xts(rep(NA,length(g)),order.by = as.POSIXct(g), tzone = tzn)

merge(datxts,gxts,all = T)$datxts

Edit: And also, your method works if you add a column of NA's to generated dataframe

dates=seq.POSIXt(min(snip[,1]), to=max(snip[,1]), by="1 hours")
grid = data.frame(date=dates,dummydata=rep(NA,length(dates)));
dat = merge(grid, snip, by="date", all=T)

Upvotes: 0

Related Questions