Reputation: 739
I am back to using r and data.table after some time and I still have issue with the join. I previously asked this question resulting in a satisfactory explanation but I still do not really get the logic. Let's consider a few examples:
library("data.table")
X <- data.table(chiave=c("a", "a", "a", "b", "b"),valore1=1:5)
Y <- data.table(chiave=c("a", "b", "c", "d"),valore2=1:4)
X
chiave valore1
1: a 1
2: a 2
3: a 3
4: b 4
5: b 5
Y
chiave valore2
1: a 1
2: b 2
3: c 3
4: d 4
when I join I get the error:
setkey(X,chiave)
X[Y]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), :
Join results in 7 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
so:
X[Y,allow.cartesian=T]
chiave valore1 valore2
1: a 1 1
2: a 2 1
3: a 3 1
4: b 4 2
5: b 5 2
6: c NA 3
7: d NA 4
Please note that X
has duplicate keys and i
doesn't. If I change Y
to:
Y <- data.table(chiave=c("b", "c", "d"),valore2=1:3)
Y
chiave valore2
1: b 1
2: c 2
3: d 3
The join is done with no error message and no need for the allow.cartesian, but logically the situation is the same: X
has multiple keys and i
doesn't.
X[Y]
chiave valore1 valore2
1: b 4 1
2: b 5 1
3: c NA 2
4: d NA 3
On the other hand:
X <- data.table(chiave=c("a", "a", "a", "a", "a", "a", "b", "b"),valore1=1:8)
Y <- data.table(chiave=c("b", "b", "d"),valore2=1:3)
X
chiave valore1
1: a 1
2: a 2
3: a 3
4: a 4
5: a 5
6: a 6
7: b 7
8: b 8
Y
chiave valore2
1: b 1
2: b 2
3: d 3
I have multiple keys in both X
and i
but the join (and a cartesian product) is done, with no error message and no need for allow.cartesian
setkey(X,chiave)
X[Y]
chiave valore1 valore2
1: b 7 1
2: b 8 1
3: b 7 2
4: b 8 2
5: d NA 3
From my point of view, I need to be warned if and only if I have multiple keys in both X and i (not just if the resulting table has more rows than max(nrow(x),nrow(i)
)) and only in this case I see the need of allow.cartesian
(so not in my first two examples).
Upvotes: 3
Views: 128
Reputation: 118889
Just to keep this answered, this behaviour with allow.cartesian
has been fixed in the current development version v1.9.5
, and will be soon available on CRAN as v1.9.6
. Odd versions are devel, and even stable. From NEWS:
allow.cartesian
is ignored during joins when:
i
has no duplicates andmult="all"
. Closes #742. Thanks to @nigmastar for the report.- assigning by reference, i.e.,
j
has:=
. Closes #800. Thanks to @matthieugomez for the report.In both these cases (and during a
not-join
which was already fixed in 1.9.4),allow.cartesian
can be safely ignored.
Upvotes: 2