Reputation: 527
Say I have a data table:
dt <- data.table(
datetime = seq(as.POSIXct("2016-01-01 00:00:00"),as.POSIXct("2016-01-01 10:00:00"), by = "1 hour"),
ObType = c("A","A","B","B","B","B","A","A","B","A","A")
)
dt
datetime ObType
1: 2016-01-01 00:00:00 A
2: 2016-01-01 01:00:00 A
3: 2016-01-01 02:00:00 B
4: 2016-01-01 03:00:00 B
5: 2016-01-01 04:00:00 B
6: 2016-01-01 05:00:00 B
7: 2016-01-01 06:00:00 A
8: 2016-01-01 07:00:00 A
9: 2016-01-01 08:00:00 B
10: 2016-01-01 09:00:00 A
11: 2016-01-01 10:00:00 A
What I need to do is wherever the ObType is "B", I need to find the time to the nearest ObType "A" on either side. So the result should look like (in hours):
datetime ObType timeLag timeLead
1: 2016-01-01 00:00:00 A NA NA
2: 2016-01-01 01:00:00 A NA NA
3: 2016-01-01 02:00:00 B 1 4
4: 2016-01-01 03:00:00 B 2 3
5: 2016-01-01 04:00:00 B 3 2
6: 2016-01-01 05:00:00 B 4 1
7: 2016-01-01 06:00:00 A NA NA
8: 2016-01-01 07:00:00 A NA NA
9: 2016-01-01 08:00:00 B 1 1
10: 2016-01-01 09:00:00 A NA NA
11: 2016-01-01 10:00:00 A NA NA
I usually use data.table, but non data.table solutions are also fine.
Thanks!
Lyss
Upvotes: 12
Views: 538
Reputation: 32548
dt$timelag = NA
dt$timelead = NA
A = split(dt, dt$ObType)$A
B = split(dt, dt$ObType)$B
A_time_up = sort(A$datetime)
A_time_dn = sort(A$datetime, decreasing = TRUE)
B$timelag = apply(B, 1, function(x)
A_time_up[which(x[1] < A_time_up)[1]]
)
B$timelead = apply(B, 1, function(x)
A_time_dn[which(x[1] > A_time_dn)[1]]
)
B$timelag = (B$timelag - as.numeric(B$datetime))/(3600)
B$timelead = (as.numeric(B$datetime) - B$timelead)/(3600)
rbind(A,B)
Upvotes: 3
Reputation: 59602
The approach I hinted at using roll=
:
X = dt[ObType=="A"]
X
datetime ObType
1: 2016-01-01 00:00:00 A
2: 2016-01-01 01:00:00 A
3: 2016-01-01 06:00:00 A
4: 2016-01-01 07:00:00 A
5: 2016-01-01 09:00:00 A
6: 2016-01-01 10:00:00 A
dt[ObType=="B", Lag:=X[.SD,on="datetime",roll=Inf,i.datetime-x.datetime]]
dt[ObType=="B", Lead:=X[.SD,on="datetime",roll=-Inf,x.datetime-i.datetime]]
dt[ObType=="B", Nearest:=X[.SD,on="datetime",roll="nearest",x.datetime-i.datetime]]
dt
datetime ObType Lag Lead Nearest
1: 2016-01-01 00:00:00 A NA hours NA hours NA hours
2: 2016-01-01 01:00:00 A NA hours NA hours NA hours
3: 2016-01-01 02:00:00 B 1 hours 4 hours -1 hours
4: 2016-01-01 03:00:00 B 2 hours 3 hours -2 hours
5: 2016-01-01 04:00:00 B 3 hours 2 hours 2 hours
6: 2016-01-01 05:00:00 B 4 hours 1 hours 1 hours
7: 2016-01-01 06:00:00 A NA hours NA hours NA hours
8: 2016-01-01 07:00:00 A NA hours NA hours NA hours
9: 2016-01-01 08:00:00 B 1 hours 1 hours -1 hours
10: 2016-01-01 09:00:00 A NA hours NA hours NA hours
11: 2016-01-01 10:00:00 A NA hours NA hours NA hours
One advantage of roll=
is that you can apply a staleness limit just by changing the Inf
to the limit of time you wish to join within. It's the time difference that the limit applies to, not the number of rows. Inf
just means don't limit. The roll=
sign indicates whether to look forwards or backwards (lead or lag).
Another advantage is that roll=
is fast.
Upvotes: 10
Reputation: 26258
Two approaches, one using joins, the other using reshaping
There is probably a better approach that uses rolling joins / non-equi joins, but here's a brute-force approach
dt2 <- dt[, key := 1][
dt,
on = "key",
allow.cartesian = T
][
ObType != i.ObType
][
, `:=`(lag_min = datetime - i.datetime,
lag_max = i.datetime - datetime)
]
dt_min <- dt2[ObType == "B" & lag_min > 0, .(timeLag = min(lag_min)), by = .(datetime, ObType)]
dt_max <- dt2[ObType == "B" & lag_max > 0, .(timeLead = min(lag_max)), by = .(datetime, ObType)]
dt_max[ dt_min[ dt, on = c("datetime", "ObType"), nomatch = NA], on = c("datetime", "ObType"), nomatch = NA]
# datetime ObType lag_max lag_min key
# 1: 2016-01-01 00:00:00 A NA hours NA hours 1
# 2: 2016-01-01 01:00:00 A NA hours NA hours 1
# 3: 2016-01-01 02:00:00 B 4 hours 1 hours 1
# 4: 2016-01-01 03:00:00 B 3 hours 2 hours 1
# 5: 2016-01-01 04:00:00 B 2 hours 3 hours 1
# 6: 2016-01-01 05:00:00 B 1 hours 4 hours 1
# 7: 2016-01-01 06:00:00 A NA hours NA hours 1
# 8: 2016-01-01 07:00:00 A NA hours NA hours 1
# 9: 2016-01-01 08:00:00 B 1 hours 1 hours 1
# 10: 2016-01-01 09:00:00 A NA hours NA hours 1
# 11: 2016-01-01 10:00:00 A NA hours NA hours 1
It's quite involved, and some of the steps can obviously be simplified, but I'm throwing it all in here anyway so you can see the process
dt[, group := rleid(ObType)]
dt_cast <- dcast(dt, formula = datetime + group ~ ObType, value.var = "ObType")
dt_cast[, `:=`(group_before = group - 1,
group_after = group + 1)]
dt_min <- dt_cast[ !is.na(B) ][dt_cast[!is.na(A), .(datetime, group)] , on = c(group_before = "group") , allow.cartesian = T][, max(i.datetime), by = group]
dt_max <- dt_cast[ !is.na(B) ][dt_cast[!is.na(A), .(datetime, group)] , on = c(group_after = "group") , allow.cartesian = T][, min(i.datetime), by = group]
dt_cast <- rbindlist(list(
dt_cast[ dt_min, on = c("group"), nomatch = 0],
dt_cast[ dt_max, on = c("group"), nomatch = 0]
))
dt <- dt_cast[ dt, on = c("datetime", "group"), nomatch = NA][, .(datetime, ObType, lag = V1)]
dt[ObType == "B" , lag_type := c("lag", "lead"), by = .(datetime, ObType)]
dt <- dcast(dt, formula = datetime + ObType ~ lag_type, value.var = "lag")
dt[, `:=`(timeLag = difftime(datetime, lag),
timeLead = difftime(lead, datetime),
`NA` = NULL)]
dt
# datetime ObType lag lead timeLag timeLead
# 1: 2016-01-01 00:00:00 A <NA> <NA> NA hours NA hours
# 2: 2016-01-01 01:00:00 A <NA> <NA> NA hours NA hours
# 3: 2016-01-01 02:00:00 B 2016-01-01 01:00:00 2016-01-01 06:00:00 1 hours 4 hours
# 4: 2016-01-01 03:00:00 B 2016-01-01 01:00:00 2016-01-01 06:00:00 2 hours 3 hours
# 5: 2016-01-01 04:00:00 B 2016-01-01 01:00:00 2016-01-01 06:00:00 3 hours 2 hours
# 6: 2016-01-01 05:00:00 B 2016-01-01 01:00:00 2016-01-01 06:00:00 4 hours 1 hours
# 7: 2016-01-01 06:00:00 A <NA> <NA> NA hours NA hours
# 8: 2016-01-01 07:00:00 A <NA> <NA> NA hours NA hours
# 9: 2016-01-01 08:00:00 B 2016-01-01 07:00:00 2016-01-01 09:00:00 1 hours 1 hours
# 10: 2016-01-01 09:00:00 A <NA> <NA> NA hours NA hours
# 11: 2016-01-01 10:00:00 A <NA> <NA> NA hours NA hours
Upvotes: 3