Douglas Clark
Douglas Clark

Reputation: 3005

How to make continuous time sequences within groups in data.table?

I have a data.table containing time series of hourly observations from different locations (sites). There are gaps -- missing hours -- in each sequence. I want to fill out the sequence of hours for each site, so each sequence has a row for each hour (although data will be missing, NA).

Example data:

library(data.table)
library(lubridate)

DT <- data.table(site = rep(LETTERS[1:2], each = 3),
                 date = ymd_h(c("2017080101", "2017080103", "2017080105",
                                "2017080103", "2017080105", "2017080107")),
                 # x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3), 
                 x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 
                 key = c("site", "date"))
DT
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 03:00:00 1.2
# 3:    A 2017-08-01 05:00:00 1.3
# 4:    B 2017-08-01 03:00:00 2.1
# 5:    B 2017-08-01 05:00:00 2.2
# 6:    B 2017-08-01 07:00:00 2.3

The desired result DT2 would contain all the hours between the first (minimum) date and the last (maximum) date for each site, with x missing where the new rows are inserted:

#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3

I have tried to join DT with a date sequence constructed from min(date) and max(date). This is in the right direction, but the date range is over all sites rather than for each individual site, the filled in rows have missing site, and the sort order (key) is wrong:

DT[.(seq(from = min(date), to = max(date), by = "hour")),
    .SD, on="date"]
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:   NA 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    B 2017-08-01 03:00:00 2.1
# 5:   NA 2017-08-01 04:00:00  NA
# 6:    A 2017-08-01 05:00:00 1.3
# 7:    B 2017-08-01 05:00:00 2.2
# 8:   NA 2017-08-01 06:00:00  NA
# 9:    B 2017-08-01 07:00:00 2.3

So I naturally tried adding by = site:

DT[.(seq(from = min(date), to = max(date), by = "hour")),
   .SD, on="date", by=.(site)]
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 03:00:00 1.2
# 3:    A 2017-08-01 05:00:00 1.3
# 4:   NA                <NA>  NA
# 5:    B 2017-08-01 03:00:00 2.1
# 6:    B 2017-08-01 05:00:00 2.2
# 7:    B 2017-08-01 07:00:00 2.3

But this doesn't work either. Can anyone suggest the right data.table formulation to give the desired filled-out DT2 shown above?

Upvotes: 1

Views: 134

Answers (2)

Douglas Clark
Douglas Clark

Reputation: 3005

Thanks to both Frank and Wen for putting me on the right track. I found a compact data.table solution. The result DT2 is also keyed on site and date, as in the input table (which is desirable although I didn't request this in the OP). This is a reformulation of Wen's solution, in data.table syntax, which I assume will be slightly more efficient on large datasets.

DT2 <- DT[setkey(DT[, .(date = seq(from = min(date), to = max(date), 
                         by = "hour")), by = site], site, date), ]
DT2
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT2)
# [1] "site" "date"

EDIT1: As mentioned by Frank, the on= syntax can also be used. The following DT3 formulation gives the correct answer, but DT3 is not keyed, while the DT2 result is keyed. That means an 'extra' setkey() would be needed if a keyed result was desired.

DT3 <- DT[DT[, .(date = seq(from = min(date), to = max(date), 
                  by = "hour")), by = site], on = c("site", "date"), ]
DT3
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT3)
# NULL
all.equal(DT2, DT3)
# [1] "Datasets has different keys. 'target': site, date. 'current' has no key."
all.equal(DT2, DT3, check.attributes = FALSE)
# [1] TRUE

Is there a way to write the DT3 expression to give a keyed result, other than expressly using setkey()?

EDIT2: Frank's comment suggests an additional formulation DT4 using keyby = .EACHI. In this case .SD is inserted as j, which is required when by or keyby is used. This gives the correct answer and the result is keyed like the DT2 formulation.

DT4 <- DT[DT[, .(date = seq(from = min(date), to = max(date), by = "hour")), 
             by = site], on = c("site", "date"), .SD, keyby = .EACHI]
DT4
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT4)
# [1] "site" "date"
identical(DT2, DT4)
# [1] TRUE

Upvotes: 2

BENY
BENY

Reputation: 323276

library(data.table)
library(lubridate)  
setDT(DT)
test <- DT[, .(date = seq(min(date), max(date), by = 'hour')), by = 
              'site']
DT <- merge(test, DT, by = c('site', 'date'), all.x = TRUE)


DT
    site                date   x
 1:    A 2017-08-01 01:00:00 1.1
 2:    A 2017-08-01 02:00:00  NA
 3:    A 2017-08-01 03:00:00 1.2
 4:    A 2017-08-01 04:00:00  NA
 5:    A 2017-08-01 05:00:00 1.3
 6:    B 2017-08-01 03:00:00 2.1
 7:    B 2017-08-01 04:00:00  NA
 8:    B 2017-08-01 05:00:00 2.2
 9:    B 2017-08-01 06:00:00  NA
10:    B 2017-08-01 07:00:00 2.3

Upvotes: 3

Related Questions