Reputation: 103898
How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y. For example, the following code performs an inner join:
x <- data.table(x = 1:2, y = c("a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
x[y]
# x y z
# 1: 1 a 10
# 2: 1 a 11
A semi-join would return just x[1]
Upvotes: 44
Views: 8030
Reputation: 66819
I tried to write a method that doesn't use any names, which are downright confusing in the OP's example.
sJ <- function(x,y){
ycols <- 1:min(ncol(y),length(key(x)))
yjoin <- unique(y[, ..ycols])
yjoin
}
x[eval(sJ(x,y))]
For Victor's simpler example, this gives the desired output:
x y
1: 1 a
2: 3 c
3: 5 e
This is a ~30% slower than Victor's way.
EDIT: And Victor's approach, taking unique before joining, is quite a bit faster:
N <- 1e5*26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
require(microbenchmark)
microbenchmark(
sJ=x[eval(sJ(x,y))],
dolla=unique(x[eval(y$x)]),
brack=x[eval(unique(y[['x']]))]
)
Unit: milliseconds
expr min lq median uq max neval
# sJ 120.22700 125.04900 126.50704 132.35326 217.6566 100
# dolla 105.05373 108.33804 109.16249 118.17613 285.9814 100
# brack 53.95656 61.32669 61.88227 65.21571 235.8048 100
I'm guessing the [[
vs $
doesn't help the speed, but didn't check.
Upvotes: 3
Reputation: 49448
I'm confused with all the not-joins above, isn't what you want simply:
unique(x[y, .SD])
# x y
#1: 1 a
If x
can have duplicate keys, then you can unique y
instead:
## Creating an example data.table 'a' three-times-repeated first row
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
setkey(y, x)
x[eval(unique(y, by = key(y))), .SD] # data.table >= 1.9.8 requires by=key(y)
# x y
# 1: 1 a
# 2: 1 a
# 3: 1 a
Upvotes: 10
Reputation: 31
Try the following:
w <- y[,unique(x)]
x[x %in% w]
Output will be:
x y
1: 1 a
Upvotes: 0
Reputation: 568
This thread is so old. But I noticed that the solution can be easily derived from the definition of semi-join given in the original post:
"A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y"
library(data.table)
dt1 <- data.table(ProdId = 1:4,
Product = c("Bread", "Cheese", "Pizza", "Butter"))
dt2 <- data.table(ProdId = c(1, 1, 3, 4, 5),
Company = c("A", "B", "C", "D", "E"))
# semi-join
unique(merge(dt1, dt2, on="ProdId")[, names(dt1), with=F])
ProdId Product
1: 1 Bread
2: 3 Pizza
3: 4 Butter
I've simply applied the syntax of inner-join, followed by filtering columns from first table only, with unique()
to remove rows of first table which were repeated to match rows of second table.
Edit: The above approach will match dplyr::semi_join()
output only if we have unique rows in the first table. If we need to output all the rows including duplicates from first table, then we may use fsetdiff()
method shown below.
Another one line data.table
solution:
fsetdiff(dt1, dt1[!dt2, on="ProdId"])
ProdId Product
1: 1 Bread
2: 3 Pizza
3: 4 Butter
I've just removed from first table the anti-join of first and second. Seems simpler to me. If the first table has duplicate rows, we will need:
fsetdiff(dt1, dt1[!dt2, on="ProdId"], all=T)
The fsetdiff()
result with ,all=T
matches the output from dplyr:
dplyr::semi_join(dt1, dt2, by="ProdId")
ProdId Product
1 1 Bread
2 3 Pizza
3 4 Butter
Using another set of data taken from one of previous posts:
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
y <- data.table(x = c(1, 1), z = 10:11)
With dplyr:
dplyr::semi_join(x, y, by="x")
x y
1 1 a
2 1 a
3 1 a
With data.table:
fsetdiff(x, x[!y, on="x"], all=T)
x y
1: 1 a
2: 1 a
3: 1 a
Without ,all=T
, the duplicate rows are removed:
fsetdiff(x, x[!y, on="x"])
x y
1: 1 a
Upvotes: 2
Reputation: 35
The package dplyr supports the following four join types:
inner_join
, left_join
, semi_join
, anti_join
So for the semi-join try the following code
library("dplyr")
table1 <- data.table(x = 1:2, y = c("a", "b"))
table2 <- data.table(x = c(1, 1), z = 10:11)
semi_join(table1, table2)
The output is as expected:
# Joining by: "x"
# Source: local data table [1 x 2]
#
# x y
# (int) (chr)
# 1 1 a
Upvotes: 0
Reputation: 59602
More possibilities :
w = unique(x[y,which=TRUE]) # the row numbers in x which have a match from y
x[w]
If there are duplicate key values in x, then that needs :
w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]
Or, the other way around :
setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]
If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.
But the anti anti join appeals too :-)
Upvotes: 23
Reputation: 4094
One solution I can think of is:
tmp <- x[!y]
x[!tmp]
In data.table
, you can have another data table as an i
expression (i.e., the first expression in the data.table.[
call), and that will perform a join, e.g.:
x <- data.table(x = 1:10, y = letters[1:10])
setkey(x, x)
y <- data.table(x = c(1,3,5,1), z = 1:4)
> x[y]
x y z
1: 1 a 1
2: 3 c 2
3: 5 e 3
4: 1 a 4
The !
before the i
expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x
that doesn't have any rows where the key (column x
) is present in y
:
> x[!y]
x y
1: 2 b
2: 4 d
3: 6 f
4: 7 g
5: 8 h
6: 9 i
7: 10 j
It is similar to setdiff
in this regard. And therefore the second statement returns all the rows in x
where the key is present in y
.
The !
feature was added in data.table 1.8.4
with the following note in NEWS:
o A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i. DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works DT[!"a"] # same result, now preferred. DT[!J(6),...] # !J == not-join DT[!2:3,...] # ! on all types of i DT[colA!=6L | colB!=23L,...] # multiple vector scanning approach (slow) DT[!J(6L,23L)] # same result, faster binary search '!' has been used rather than '-' : * to match the 'not-join'/'not-where' nomenclature * with '-', DT[-0] would return DT rather than DT[0] and not be backwards compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in base R) and after this new feature. * to leave DT[+J...] and DT[-J...] available for future use
For some reason, the following doesn't work x[!(x[!y])]
- probably data.table
is too smart about parsing the argument.
P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])]
.
Upvotes: 14
Reputation: 4094
Update. Based on all the discussion here, I would do something like this, which should be fast and work in the most general case:
x[eval(unique(y[, key(x), with = FALSE]))]
Here is another, more direct solution:
unique(x[eval(y$x)])
It's more direct and runs faster - here is the comparison in run times with my previous solution:
# Generate some large data
N <- 1000000 * 26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
system.time(r1 <- x[!eval(x[!y])])
user system elapsed
7.772 1.217 11.998
system.time(r2 <- unique(x[eval(y$x)]))
user system elapsed
0.540 0.142 0.723
In a more general case, you can do something like
x[eval(y[, key(x), with = FALSE])]
Upvotes: 3