Amitai
Amitai

Reputation: 891

Data.Table rolling join by group

How can I find the last value, prior to test.day, for each (loc.x, loc.y) pair?

dt <- data.table( 
  loc.x = as.integer(c(1, 1, 3, 1, 3, 1)),
  loc.y = as.integer(c(1, 2, 1, 2, 1, 2)),
  time = as.IDate(c("2015-03-11", "2015-05-10", "2015-09-27",
                    "2015-11-25", "2014-09-13", "2015-08-19")), 
  value = letters[1:6]
)

setkey(dt, loc.x, loc.y, time)
test.day <- as.IDate("2015-10-01")

Required output:

   loc.x loc.y value
1:     1     1     a
2:     1     2     f
3:     3     1     c

Upvotes: 8

Views: 1473

Answers (3)

David Arenburg
David Arenburg

Reputation: 92300

Here's another option using a rolling join after creating a lookup table

indx <- data.table(unique(dt[ ,.(loc.x, loc.y)]), time = test.day)  
dt[indx, roll = TRUE, on = names(indx)]
#    loc.x loc.y       time value
# 1:     1     1 2015-10-01     a
# 2:     1     2 2015-10-01     f
# 3:     3     1 2015-10-01     c

Or a very similar option suggested by @eddi

dt[dt[, .(time = test.day), by = .(loc.x, loc.y)], roll = T, on = c('loc.x', 'loc.y', 'time')]

Or a one liner which will be less efficient as it will call [.data.table by group

dt[, 
    .SD[data.table(test.day), value, roll = TRUE, on = c(time = "test.day")], 
    by = .(loc.x, loc.y)
  ]
#    loc.x loc.y V1
# 1:     1     1  a
# 2:     1     2  f
# 3:     3     1  c

Upvotes: 6

talat
talat

Reputation: 70336

You can first subset the rows where time < test.day (which should be quite efficient because it is not done by group) and then select the last value per group. To do that you can either use tail(value, 1L) or, as suggested by Floo0, value[.N], resulting in:

dt[time < test.day, tail(value, 1L), by = .(loc.x, loc.y)]
#   loc.x loc.y V1
#1:     1     1  a
#2:     1     2  f
#3:     3     1  c

or

dt[time < test.day, value[.N], by = .(loc.x, loc.y)]

Note that this works because the data is sorted due to setkey(dt, loc.x, loc.y, time).

Upvotes: 6

Jaap
Jaap

Reputation: 83275

Another option is to use the last function:

dt[, last(value[time < test.day]), by = .(loc.x, loc.y)]

which gives:

   loc.x loc.y V1
1:     1     1  a
2:     1     2  f
3:     3     1  c

Upvotes: 6

Related Questions