Reputation: 6954
Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? Please see my example below and the way I usually do it. However, this seems rather inefficient.
library(data.table)
# initial data
DT = data.table(x = c(1,1,1,2,2,1,1,2,2),
y = c(1,3,6))
# data to merge
DTx <- data.table(x = 1:3,
y = 1,
k = "X")
# regular update join
copy(DT)[DTx,
on = .(x, y),
k := i.k][]
#> x y k
#> 1: 1 1 X
#> 2: 1 3 <NA>
#> 3: 1 6 <NA>
#> 4: 2 1 X
#> 5: 2 3 <NA>
#> 6: 1 6 <NA>
#> 7: 1 1 X
#> 8: 2 3 <NA>
#> 9: 2 6 <NA>
# DTx remains the same
DTx
#> x y k
#> 1: 1 1 X
#> 2: 2 1 X
#> 3: 3 1 X
# set an Id variable
DTx[, Id := .I]
# assign the Id in merge
DT[DTx,
on = .(x, y),
`:=`(k = i.k,
matched_id = i.Id)][]
#> x y k matched_id
#> 1: 1 1 X 1
#> 2: 1 3 <NA> NA
#> 3: 1 6 <NA> NA
#> 4: 2 1 X 2
#> 5: 2 3 <NA> NA
#> 6: 1 6 <NA> NA
#> 7: 1 1 X 1
#> 8: 2 3 <NA> NA
#> 9: 2 6 <NA> NA
# use matched_id to find merged rows
DTx[, matched := fifelse(Id %in% DT$matched_id, TRUE, FALSE)]
DTx
#> x y k Id matched
#> 1: 1 1 X 1 TRUE
#> 2: 2 1 X 2 TRUE
#> 3: 3 1 X 3 FALSE
Upvotes: 6
Views: 243
Reputation: 66819
Following Jan's comment:
This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables.
You can pull the indices:
merge_metaDT = DT[DTx, on=.(x, y), .(irow = .GRP, xrow = .I), by=.EACHI]
x y irow xrow
1: 1 1 1 1
2: 1 1 1 7
3: 2 1 2 4
4: 3 1 3 0
Then apply edits to each table using indices rather than merging or matching a second time:
rowDT = merge_metaDT[xrow != 0L]
DT[rowDT$xrow, k := DTx[rowDT$irow, k]]
DTx[, matched := FALSE][rowDT$irow, matched := TRUE]
How it works:
x[i]
, the symbol .I
indexes rows of x
by=.EACHI
, .GRP
indexes each group, which means each row of i
here.I
which are coded as zerosOn this last point, we might expect NAs instead of zeros, as returned by DT[DTx, on=.(x, y), which=TRUE]
. I'm not sure why these differ.
Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? [...] seems rather inefficient.
I expect this is more efficient than multiple merges or %in%
when the merge is costly enough.
It still requires multiple steps. I doubt there's any way around that, since it would be hard to come up with logic and syntax for the update that is easy to follow.
Update logic is already complex in base R, with multiple edits on a single index allowed:
> x = c(1, 2, 3)
> x[c(1, 1)] = c(4, 5)
> x
[1] 5 2 3
And there is the question of how to match and edit multiple indices at once:
> x = c(1, 1, 3)
> x[match(c(1, 3), x)] = c(4, 5)
> x
[1] 4 1 5
In data.table updates, the latter issue is handled with mult=
. In the update-two-tables use case, these questions would get much more complicated.
Upvotes: 1