user3645882
user3645882

Reputation: 739

r data.table ( <= 1.9.4) join behaviour

I am back to using r and data.table after some time and I still have issue with the join. I previously asked this question resulting in a satisfactory explanation but I still do not really get the logic. Let's consider a few examples:

library("data.table")
X <- data.table(chiave=c("a", "a", "a", "b", "b"),valore1=1:5)
Y <- data.table(chiave=c("a", "b", "c", "d"),valore2=1:4)
X
   chiave valore1
1:      a       1
2:      a       2
3:      a       3
4:      b       4
5:      b       5
 Y
   chiave valore2
1:      a       1
2:      b       2
3:      c       3
4:      d       4

when I join I get the error:

 setkey(X,chiave)
 X[Y]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in 7 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

so:

 X[Y,allow.cartesian=T]
   chiave valore1 valore2
1:      a       1       1
2:      a       2       1
3:      a       3       1
4:      b       4       2
5:      b       5       2
6:      c      NA       3
7:      d      NA       4

Please note that X has duplicate keys and i doesn't. If I change Y to:

 Y <- data.table(chiave=c("b", "c", "d"),valore2=1:3)
 Y
   chiave valore2
1:      b       1
2:      c       2
3:      d       3

The join is done with no error message and no need for the allow.cartesian, but logically the situation is the same: X has multiple keys and i doesn't.

 X[Y]
   chiave valore1 valore2
1:      b       4       1
2:      b       5       1
3:      c      NA       2
4:      d      NA       3

On the other hand:

 X <- data.table(chiave=c("a", "a", "a", "a", "a", "a", "b", "b"),valore1=1:8)
 Y <- data.table(chiave=c("b", "b", "d"),valore2=1:3)
 X
   chiave valore1
1:      a       1
2:      a       2
3:      a       3
4:      a       4
5:      a       5
6:      a       6
7:      b       7
8:      b       8
 Y
   chiave valore2
1:      b       1
2:      b       2
3:      d       3

I have multiple keys in both X and i but the join (and a cartesian product) is done, with no error message and no need for allow.cartesian

 setkey(X,chiave)
 X[Y]
   chiave valore1 valore2
1:      b       7       1
2:      b       8       1
3:      b       7       2
4:      b       8       2
5:      d      NA       3

From my point of view, I need to be warned if and only if I have multiple keys in both X and i (not just if the resulting table has more rows than max(nrow(x),nrow(i))) and only in this case I see the need of allow.cartesian (so not in my first two examples).

Upvotes: 3

Views: 128

Answers (1)

Arun
Arun

Reputation: 118889

Just to keep this answered, this behaviour with allow.cartesian has been fixed in the current development version v1.9.5, and will be soon available on CRAN as v1.9.6. Odd versions are devel, and even stable. From NEWS:

  1. allow.cartesian is ignored during joins when:

    • i has no duplicates and mult="all". Closes #742. Thanks to @nigmastar for the report.
    • assigning by reference, i.e., j has :=. Closes #800. Thanks to @matthieugomez for the report.

    In both these cases (and during a not-join which was already fixed in 1.9.4), allow.cartesian can be safely ignored.

Upvotes: 2

Related Questions