Reputation: 886
dt1
has val
of interest in the correct sequence
library(data.table)
dt1 <- data.frame(id = 1,
key = c(paste0("a_",1:6),paste0("b_",1:6)),
val = c(122,128,134, rep(NA,3),c(110,112,114),rep(NA,3)),
var = c(rep("a",6),rep("b",6)))
id key val var
1 1 a_1 122 a
2 1 a_2 128 a
3 1 a_3 134 a
4 1 a_4 NA a
5 1 a_5 NA a
6 1 a_6 NA a
7 1 b_1 110 b
8 1 b_2 112 b
9 1 b_3 114 b
10 1 b_4 NA b
11 1 b_5 NA b
12 1 b_6 NA b
dt2
also has the val
of interest in the correct sequence but also some extra val
dt2 <- data.frame(id = 1,
key = c(paste0("a_",c(1,3:6)),paste0("b_",c(2,4:6))),
val = c(122,127,122,128,134,110,110,112,114),
var = c(rep("a",5),rep("b",4)))
id key val var
1 1 a_1 122 a
2 1 a_3 127 a
3 1 a_4 122 a
4 1 a_5 128 a
5 1 a_6 134 a
6 1 b_2 110 b
7 1 b_4 110 b
8 1 b_5 112 b
9 1 b_6 114 b
I want to match the sequence of values from dt2
to the sequence of values in dt1
and ignore extra values that are in dt2
.
I have tried a rolling join backwards as values of interest in dt2
are culstered towards the end of the sequence.
setDT(dt1,key = c("id","var","val"))
setDT(dt2,key = c("id","var","val"))
dt1[dt2, roll = -Inf]
id key val var i.key
1: 1 a_1 122 a a_1 # wrong
2: 1 a_1 122 a a_4
3: 1 a_2 127 a a_3 # wrong
4: 1 a_2 128 a a_5
5: 1 a_3 134 a a_6
6: 1 b_1 110 b b_2 # wrong
7: 1 b_1 110 b b_4
8: 1 b_2 112 b b_5
9: 1 b_3 114 b b_6
It looks like duplicate values from dt2
that are in dt1
but not in the sequence I am looking for are causing an issue. Also I want to keep i.key
to know the original key as it will be used for other processing. I have also tried: merge(dt1,dt2)
id key val var i.key
1 a_1 122 a a_4
1 a_2 128 a a_5
1 a_3 134 a a_6
1 b_1 110 b b_4
1 b_2 112 b b_5
1 b_3 114 b b_6
I would be grateful for some guidance
Upvotes: 1
Views: 116
Reputation: 25223
Assuming that we need to find the whole sequence from dt1
within dt2
, here is another option:
setDT(dt1)
setDT(dt2)
cols <- c("sid", "cnt")
#create sequence index and count of non-NAs
DT1 <- dt1[!is.na(val)][, (cols) := .(seq(.N), .N), .(id, var)]
#inner join to id, var, val that exists in both
DT2 <- DT1[dt2, on=.(id, var, val), nomatch=0L, .(id, var, val, sid, cnt, i.key)]
#identify rows with consecutive seq index and filter for those rows
consec <- DT2[, if(.N == cnt[1L]) .SD, .(cs=cumsum(c(0L, diff(sid)!=1L)))]
#perform join to get desired output
DT1[consec, on=.(id, var, val)][, c(cols, "cs", "i.cnt") := NULL][]
output:
id key val var i.key
1: 1 a_1 122 a A_4
2: 1 a_2 128 a A_5
3: 1 a_3 134 a A_6
4: 1 b_1 110 b B_4
5: 1 b_2 112 b B_5
6: 1 b_3 114 b B_6
7: 1 c_1 110 c C_3
8: 1 c_2 112 c C_5
9: 1 c_3 114 c C_6
data with additional group based on comment:
library(data.table)
dt1 <- data.frame(id = 1,
key = c(paste0("a_",1:6),paste0("b_",1:6),paste0("c_",1:6)),
val = c(122,128,134, rep(NA,3),c(110,112,114),rep(NA,3), c(110,112,114),rep(NA,3)),
var = c(rep("a",6),rep("b",6),rep("c",6)))
dt2 <- data.frame(id = 1,
key = c(paste0("A_",c(1,3:6)),paste0("B_",c(2,4:6)),paste0("C_",c(2,3:6))),
val = c(122,127,122,128,134,110,110,112,114,134,110,200,112,114),
var = c(rep("a",5),rep("b",4),rep("c",5)))
Upvotes: 3
Reputation: 24878
Here's an approach that first joins back onto dt2
to filter out key == key
pairs because you can't do a non-equi join and rolling join at the same time in data.table
.
Also remember that only the last column in your on =
argument can be rolling, so I recommend always explicitly defining it.
library(data.table)
dt2[dt1, on = .(id,var,val),nomatch = 0][
key != i.key][dt1,on = .(id,var,val,i.key=key), roll = -Inf, nomatch = 0][
,.SD[.N],by = .(id,val,var)][,.(id,val,var,key1 = i.key,key2 = key)]
id val var key1 key2
1: 1 122 a a_1 a_4
2: 1 128 a a_2 a_5
3: 1 134 a a_3 a_6
4: 1 110 b b_1 b_4
5: 1 112 b b_2 b_5
6: 1 114 b b_3 b_6
Upvotes: 3
Reputation: 541
There is probably nicer way to do it, but I think this will work:
merge(dt1, dt2, by = c( "val","var")) %>%
arrange(key.x) %>%
filter(key.x != key.y) %>%
mutate (id = id.x, key = key.x, i.key = key.y) %>%
select (id, key, val, var, i.key)
Result:
id key val var i.key
1 1 a_1 122 a a_4
2 1 a_2 128 a a_5
3 1 a_3 134 a a_6
4 1 b_1 110 b b_2
5 1 b_1 110 b b_4
6 1 b_2 112 b b_5
7 1 b_3 114 b b_6
Upvotes: 0