e.matt
e.matt

Reputation: 886

rolling join two data.tables

dt1 has val of interest in the correct sequence

library(data.table)
dt1 <- data.frame(id = 1,
                  key = c(paste0("a_",1:6),paste0("b_",1:6)), 
                  val = c(122,128,134, rep(NA,3),c(110,112,114),rep(NA,3)),
                  var = c(rep("a",6),rep("b",6)))

   id key val var
1   1 a_1 122   a
2   1 a_2 128   a
3   1 a_3 134   a
4   1 a_4  NA   a
5   1 a_5  NA   a
6   1 a_6  NA   a
7   1 b_1 110   b
8   1 b_2 112   b
9   1 b_3 114   b
10  1 b_4  NA   b
11  1 b_5  NA   b
12  1 b_6  NA   b

dt2 also has the val of interest in the correct sequence but also some extra val

dt2 <- data.frame(id = 1,
                  key = c(paste0("a_",c(1,3:6)),paste0("b_",c(2,4:6))),
                  val = c(122,127,122,128,134,110,110,112,114),
                  var = c(rep("a",5),rep("b",4)))

  id key val var
1  1 a_1 122   a
2  1 a_3 127   a
3  1 a_4 122   a
4  1 a_5 128   a
5  1 a_6 134   a
6  1 b_2 110   b
7  1 b_4 110   b
8  1 b_5 112   b
9  1 b_6 114   b

I want to match the sequence of values from dt2 to the sequence of values in dt1 and ignore extra values that are in dt2. I have tried a rolling join backwards as values of interest in dt2 are culstered towards the end of the sequence.

Current attempt:

setDT(dt1,key = c("id","var","val"))
setDT(dt2,key = c("id","var","val"))
dt1[dt2, roll = -Inf]
   id key val var i.key
1:  1 a_1 122   a   a_1 # wrong
2:  1 a_1 122   a   a_4 
3:  1 a_2 127   a   a_3 # wrong
4:  1 a_2 128   a   a_5 
5:  1 a_3 134   a   a_6 
6:  1 b_1 110   b   b_2 # wrong
7:  1 b_1 110   b   b_4 
8:  1 b_2 112   b   b_5
9:  1 b_3 114   b   b_6

It looks like duplicate values from dt2 that are in dt1 but not in the sequence I am looking for are causing an issue. Also I want to keep i.key to know the original key as it will be used for other processing. I have also tried: merge(dt1,dt2)

Desired output:

id key val var i.key
 1 a_1 122  a  a_4
 1 a_2 128  a  a_5
 1 a_3 134  a  a_6
 1 b_1 110  b  b_4
 1 b_2 112  b  b_5
 1 b_3 114  b  b_6

I would be grateful for some guidance

Upvotes: 1

Views: 116

Answers (3)

chinsoon12
chinsoon12

Reputation: 25223

Assuming that we need to find the whole sequence from dt1 within dt2, here is another option:

setDT(dt1)
setDT(dt2)
cols <- c("sid", "cnt")

#create sequence index and count of non-NAs
DT1 <- dt1[!is.na(val)][, (cols) := .(seq(.N), .N), .(id, var)]

#inner join to id, var, val that exists in both
DT2 <- DT1[dt2, on=.(id, var, val), nomatch=0L, .(id, var, val, sid, cnt, i.key)]

#identify rows with consecutive seq index and filter for those rows
consec <- DT2[, if(.N == cnt[1L]) .SD, .(cs=cumsum(c(0L, diff(sid)!=1L)))]

#perform join to get desired output
DT1[consec, on=.(id, var, val)][, c(cols, "cs", "i.cnt") := NULL][]

output:

   id key val var i.key
1:  1 a_1 122   a   A_4
2:  1 a_2 128   a   A_5
3:  1 a_3 134   a   A_6
4:  1 b_1 110   b   B_4
5:  1 b_2 112   b   B_5
6:  1 b_3 114   b   B_6
7:  1 c_1 110   c   C_3
8:  1 c_2 112   c   C_5
9:  1 c_3 114   c   C_6

data with additional group based on comment:

library(data.table)
dt1 <- data.frame(id = 1,
  key = c(paste0("a_",1:6),paste0("b_",1:6),paste0("c_",1:6)),
  val = c(122,128,134, rep(NA,3),c(110,112,114),rep(NA,3), c(110,112,114),rep(NA,3)),
  var = c(rep("a",6),rep("b",6),rep("c",6)))
dt2 <- data.frame(id = 1,
  key = c(paste0("A_",c(1,3:6)),paste0("B_",c(2,4:6)),paste0("C_",c(2,3:6))),
  val = c(122,127,122,128,134,110,110,112,114,134,110,200,112,114),
  var = c(rep("a",5),rep("b",4),rep("c",5)))

Upvotes: 3

Ian Campbell
Ian Campbell

Reputation: 24878

Here's an approach that first joins back onto dt2 to filter out key == key pairs because you can't do a non-equi join and rolling join at the same time in data.table.

Also remember that only the last column in your on = argument can be rolling, so I recommend always explicitly defining it.

library(data.table)
dt2[dt1, on = .(id,var,val),nomatch = 0][
  key != i.key][dt1,on = .(id,var,val,i.key=key), roll = -Inf, nomatch = 0][
    ,.SD[.N],by = .(id,val,var)][,.(id,val,var,key1 = i.key,key2 = key)]
   id val var key1 key2
1:  1 122   a  a_1  a_4
2:  1 128   a  a_2  a_5
3:  1 134   a  a_3  a_6
4:  1 110   b  b_1  b_4
5:  1 112   b  b_2  b_5
6:  1 114   b  b_3  b_6

Upvotes: 3

Shan R
Shan R

Reputation: 541

There is probably nicer way to do it, but I think this will work:

merge(dt1, dt2, by = c( "val","var")) %>%
  arrange(key.x) %>%
  filter(key.x != key.y) %>%
  mutate (id = id.x, key = key.x, i.key = key.y) %>%
  select (id, key, val, var, i.key)

Result:

  id key val var i.key
1  1 a_1 122   a   a_4
2  1 a_2 128   a   a_5
3  1 a_3 134   a   a_6
4  1 b_1 110   b   b_2
5  1 b_1 110   b   b_4
6  1 b_2 112   b   b_5
7  1 b_3 114   b   b_6

Upvotes: 0

Related Questions