Bucket
Bucket

Reputation: 527

Find time to nearest occurrence of particular value for each row

Say I have a data table:

dt <- data.table(
        datetime = seq(as.POSIXct("2016-01-01 00:00:00"),as.POSIXct("2016-01-01 10:00:00"), by = "1 hour"),
        ObType = c("A","A","B","B","B","B","A","A","B","A","A")
)

dt
                   datetime ObType
     1: 2016-01-01 00:00:00      A
     2: 2016-01-01 01:00:00      A
     3: 2016-01-01 02:00:00      B
     4: 2016-01-01 03:00:00      B
     5: 2016-01-01 04:00:00      B
     6: 2016-01-01 05:00:00      B
     7: 2016-01-01 06:00:00      A
     8: 2016-01-01 07:00:00      A
     9: 2016-01-01 08:00:00      B
    10: 2016-01-01 09:00:00      A
    11: 2016-01-01 10:00:00      A

What I need to do is wherever the ObType is "B", I need to find the time to the nearest ObType "A" on either side. So the result should look like (in hours):

               datetime ObType timeLag timeLead
 1: 2016-01-01 00:00:00      A      NA       NA
 2: 2016-01-01 01:00:00      A      NA       NA
 3: 2016-01-01 02:00:00      B       1        4
 4: 2016-01-01 03:00:00      B       2        3
 5: 2016-01-01 04:00:00      B       3        2
 6: 2016-01-01 05:00:00      B       4        1
 7: 2016-01-01 06:00:00      A      NA       NA
 8: 2016-01-01 07:00:00      A      NA       NA
 9: 2016-01-01 08:00:00      B       1        1
10: 2016-01-01 09:00:00      A      NA       NA
11: 2016-01-01 10:00:00      A      NA       NA

I usually use data.table, but non data.table solutions are also fine.

Thanks!

Lyss

Upvotes: 12

Views: 538

Answers (3)

d.b
d.b

Reputation: 32548

dt$timelag = NA
dt$timelead = NA

A = split(dt, dt$ObType)$A
B = split(dt, dt$ObType)$B

A_time_up = sort(A$datetime)
A_time_dn = sort(A$datetime, decreasing = TRUE)

B$timelag = apply(B, 1, function(x) 
    A_time_up[which(x[1] < A_time_up)[1]]
)

B$timelead = apply(B, 1, function(x) 
    A_time_dn[which(x[1] > A_time_dn)[1]]
)

B$timelag = (B$timelag - as.numeric(B$datetime))/(3600)
B$timelead = (as.numeric(B$datetime) - B$timelead)/(3600)

rbind(A,B)

Upvotes: 3

Matt Dowle
Matt Dowle

Reputation: 59602

The approach I hinted at using roll= :

X = dt[ObType=="A"]
X
              datetime ObType
1: 2016-01-01 00:00:00      A
2: 2016-01-01 01:00:00      A
3: 2016-01-01 06:00:00      A
4: 2016-01-01 07:00:00      A
5: 2016-01-01 09:00:00      A
6: 2016-01-01 10:00:00      A

dt[ObType=="B", Lag:=X[.SD,on="datetime",roll=Inf,i.datetime-x.datetime]]
dt[ObType=="B", Lead:=X[.SD,on="datetime",roll=-Inf,x.datetime-i.datetime]]
dt[ObType=="B", Nearest:=X[.SD,on="datetime",roll="nearest",x.datetime-i.datetime]]
dt
               datetime ObType      Lag     Lead     Nearest
 1: 2016-01-01 00:00:00      A NA hours NA hours    NA hours
 2: 2016-01-01 01:00:00      A NA hours NA hours    NA hours
 3: 2016-01-01 02:00:00      B  1 hours  4 hours    -1 hours
 4: 2016-01-01 03:00:00      B  2 hours  3 hours    -2 hours
 5: 2016-01-01 04:00:00      B  3 hours  2 hours     2 hours
 6: 2016-01-01 05:00:00      B  4 hours  1 hours     1 hours
 7: 2016-01-01 06:00:00      A NA hours NA hours    NA hours
 8: 2016-01-01 07:00:00      A NA hours NA hours    NA hours
 9: 2016-01-01 08:00:00      B  1 hours  1 hours    -1 hours
10: 2016-01-01 09:00:00      A NA hours NA hours    NA hours
11: 2016-01-01 10:00:00      A NA hours NA hours    NA hours

One advantage of roll= is that you can apply a staleness limit just by changing the Inf to the limit of time you wish to join within. It's the time difference that the limit applies to, not the number of rows. Inf just means don't limit. The roll= sign indicates whether to look forwards or backwards (lead or lag).

Another advantage is that roll= is fast.

Upvotes: 10

SymbolixAU
SymbolixAU

Reputation: 26258

Two approaches, one using joins, the other using reshaping

Joins

There is probably a better approach that uses rolling joins / non-equi joins, but here's a brute-force approach

dt2 <- dt[, key := 1][ 
    dt, 
    on = "key", 
    allow.cartesian = T
    ][
        ObType != i.ObType
        ][
            , `:=`(lag_min = datetime - i.datetime,
                         lag_max = i.datetime - datetime)
            ]


dt_min <- dt2[ObType == "B" & lag_min > 0, .(timeLag = min(lag_min)), by = .(datetime, ObType)]
dt_max <- dt2[ObType == "B" & lag_max > 0, .(timeLead = min(lag_max)), by = .(datetime, ObType)]


dt_max[ dt_min[ dt, on = c("datetime", "ObType"), nomatch = NA], on = c("datetime", "ObType"), nomatch = NA]

#                datetime ObType  lag_max  lag_min key
#  1: 2016-01-01 00:00:00      A NA hours NA hours   1
#  2: 2016-01-01 01:00:00      A NA hours NA hours   1
#  3: 2016-01-01 02:00:00      B  4 hours  1 hours   1
#  4: 2016-01-01 03:00:00      B  3 hours  2 hours   1
#  5: 2016-01-01 04:00:00      B  2 hours  3 hours   1
#  6: 2016-01-01 05:00:00      B  1 hours  4 hours   1
#  7: 2016-01-01 06:00:00      A NA hours NA hours   1
#  8: 2016-01-01 07:00:00      A NA hours NA hours   1
#  9: 2016-01-01 08:00:00      B  1 hours  1 hours   1
# 10: 2016-01-01 09:00:00      A NA hours NA hours   1
# 11: 2016-01-01 10:00:00      A NA hours NA hours   1

Reshaping

It's quite involved, and some of the steps can obviously be simplified, but I'm throwing it all in here anyway so you can see the process

dt[, group := rleid(ObType)]
dt_cast <- dcast(dt, formula = datetime + group ~ ObType, value.var = "ObType")

dt_cast[, `:=`(group_before = group - 1,
                             group_after = group + 1)]


dt_min <- dt_cast[ !is.na(B) ][dt_cast[!is.na(A), .(datetime, group)] , on = c(group_before = "group")  , allow.cartesian = T][, max(i.datetime), by = group]
dt_max <- dt_cast[ !is.na(B) ][dt_cast[!is.na(A), .(datetime, group)] , on = c(group_after = "group")  , allow.cartesian = T][, min(i.datetime), by = group]


dt_cast <- rbindlist(list(
    dt_cast[ dt_min, on = c("group"), nomatch = 0],
    dt_cast[ dt_max, on = c("group"), nomatch = 0]
))

dt <- dt_cast[ dt, on = c("datetime", "group"), nomatch = NA][, .(datetime, ObType, lag = V1)]

dt[ObType == "B" , lag_type := c("lag", "lead"), by = .(datetime, ObType)]
dt <- dcast(dt, formula = datetime + ObType ~ lag_type, value.var = "lag")

dt[, `:=`(timeLag = difftime(datetime, lag),
                    timeLead = difftime(lead, datetime),
                    `NA` = NULL)]

dt
#                datetime ObType                 lag                lead  timeLag timeLead
#  1: 2016-01-01 00:00:00      A                <NA>                <NA> NA hours NA hours
#  2: 2016-01-01 01:00:00      A                <NA>                <NA> NA hours NA hours
#  3: 2016-01-01 02:00:00      B 2016-01-01 01:00:00 2016-01-01 06:00:00  1 hours  4 hours
#  4: 2016-01-01 03:00:00      B 2016-01-01 01:00:00 2016-01-01 06:00:00  2 hours  3 hours
#  5: 2016-01-01 04:00:00      B 2016-01-01 01:00:00 2016-01-01 06:00:00  3 hours  2 hours
#  6: 2016-01-01 05:00:00      B 2016-01-01 01:00:00 2016-01-01 06:00:00  4 hours  1 hours
#  7: 2016-01-01 06:00:00      A                <NA>                <NA> NA hours NA hours
#  8: 2016-01-01 07:00:00      A                <NA>                <NA> NA hours NA hours
#  9: 2016-01-01 08:00:00      B 2016-01-01 07:00:00 2016-01-01 09:00:00  1 hours  1 hours
# 10: 2016-01-01 09:00:00      A                <NA>                <NA> NA hours NA hours
# 11: 2016-01-01 10:00:00      A                <NA>                <NA> NA hours NA hours

Upvotes: 3

Related Questions