Reputation: 3005
I have a data.table
containing time series of hourly observations from different locations (sites). There are gaps -- missing hours -- in each sequence. I want to fill out the sequence of hours for each site, so each sequence has a row for each hour (although data will be missing, NA).
Example data:
library(data.table)
library(lubridate)
DT <- data.table(site = rep(LETTERS[1:2], each = 3),
date = ymd_h(c("2017080101", "2017080103", "2017080105",
"2017080103", "2017080105", "2017080107")),
# x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3),
x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3),
key = c("site", "date"))
DT
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 03:00:00 1.2
# 3: A 2017-08-01 05:00:00 1.3
# 4: B 2017-08-01 03:00:00 2.1
# 5: B 2017-08-01 05:00:00 2.2
# 6: B 2017-08-01 07:00:00 2.3
The desired result DT2
would contain all the hours between the first (minimum) date and the last (maximum) date for each site, with x missing where the new rows are inserted:
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
I have tried to join DT
with a date sequence constructed from min(date)
and max(date)
. This is in the right direction, but the date range is over all sites rather than for each individual site, the filled in rows have missing site, and the sort order (key) is wrong:
DT[.(seq(from = min(date), to = max(date), by = "hour")),
.SD, on="date"]
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: NA 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: B 2017-08-01 03:00:00 2.1
# 5: NA 2017-08-01 04:00:00 NA
# 6: A 2017-08-01 05:00:00 1.3
# 7: B 2017-08-01 05:00:00 2.2
# 8: NA 2017-08-01 06:00:00 NA
# 9: B 2017-08-01 07:00:00 2.3
So I naturally tried adding by = site
:
DT[.(seq(from = min(date), to = max(date), by = "hour")),
.SD, on="date", by=.(site)]
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 03:00:00 1.2
# 3: A 2017-08-01 05:00:00 1.3
# 4: NA <NA> NA
# 5: B 2017-08-01 03:00:00 2.1
# 6: B 2017-08-01 05:00:00 2.2
# 7: B 2017-08-01 07:00:00 2.3
But this doesn't work either. Can anyone suggest the right data.table
formulation to give the desired filled-out DT2
shown above?
Upvotes: 1
Views: 134
Reputation: 3005
Thanks to both Frank and Wen for putting me on the right track. I found a compact data.table
solution. The result DT2
is also keyed on site and date, as in the input table (which is desirable although I didn't request this in the OP). This is a reformulation of Wen's solution, in data.table
syntax, which I assume will be slightly more efficient on large datasets.
DT2 <- DT[setkey(DT[, .(date = seq(from = min(date), to = max(date),
by = "hour")), by = site], site, date), ]
DT2
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT2)
# [1] "site" "date"
EDIT1: As mentioned by Frank, the on=
syntax can also be used. The following DT3
formulation gives the correct answer, but DT3
is not keyed, while the DT2
result is keyed. That means an 'extra' setkey()
would be needed if a keyed result was desired.
DT3 <- DT[DT[, .(date = seq(from = min(date), to = max(date),
by = "hour")), by = site], on = c("site", "date"), ]
DT3
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT3)
# NULL
all.equal(DT2, DT3)
# [1] "Datasets has different keys. 'target': site, date. 'current' has no key."
all.equal(DT2, DT3, check.attributes = FALSE)
# [1] TRUE
Is there a way to write the DT3
expression to give a keyed result, other than expressly using setkey()
?
EDIT2: Frank's comment suggests an additional formulation DT4
using keyby = .EACHI
. In this case .SD
is inserted as j
, which is required when by
or keyby
is used. This gives the correct answer and the result is keyed like the DT2
formulation.
DT4 <- DT[DT[, .(date = seq(from = min(date), to = max(date), by = "hour")),
by = site], on = c("site", "date"), .SD, keyby = .EACHI]
DT4
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT4)
# [1] "site" "date"
identical(DT2, DT4)
# [1] TRUE
Upvotes: 2
Reputation: 323276
library(data.table)
library(lubridate)
setDT(DT)
test <- DT[, .(date = seq(min(date), max(date), by = 'hour')), by =
'site']
DT <- merge(test, DT, by = c('site', 'date'), all.x = TRUE)
DT
site date x
1: A 2017-08-01 01:00:00 1.1
2: A 2017-08-01 02:00:00 NA
3: A 2017-08-01 03:00:00 1.2
4: A 2017-08-01 04:00:00 NA
5: A 2017-08-01 05:00:00 1.3
6: B 2017-08-01 03:00:00 2.1
7: B 2017-08-01 04:00:00 NA
8: B 2017-08-01 05:00:00 2.2
9: B 2017-08-01 06:00:00 NA
10: B 2017-08-01 07:00:00 2.3
Upvotes: 3