red_quark
red_quark

Reputation: 1001

R: cross join column on itself excluding symmetric results

I have a data table init that looks like this:

> init
+---+
| id|       
+---+
|  a|
+---+
|  b|
+---+
|  c|
+---+

I want to obtain all pairs for id column, so I need to cross join the init data table on itself. Additionally, I want to exclude equal and symmetric results (in my case a,b == b,a, etc.).

Desired output is:

+---+---+
|id1|id2|
+---+---+
|  a|  b|
|  a|  c|
|  b|  c|
+---+---+

How can this be done with the data.table approach?

Full cross join can be implemented as full_cj <- CJ(init$id, init$id):

+---+---+
| V1| V2|
+---+---+
|  a|  a|
|  a|  b|
|  a|  c|
|  b|  a|
|  b|  b|
|  b|  c|
|  c|  a|
|  c|  b|
|  c|  c|
+---+---+

But how can I remove symmetrical and identical results from the output?
My real data is huge, so I'm looking for an efficient solution.

Upvotes: 3

Views: 137

Answers (2)

Uwe
Uwe

Reputation: 42592

For the sake of completeness, this can be achieved as well using base R by

init[, .(id1 = rep(id, rev(seq_along(id) - 1L)),
         id2 = unlist(lapply(-seq_along(id), tail, x = id)))]

Benchmark

As the OP has mentioned

My real data is huge, so I'm looking for an efficient solution.

here is a benchmark comparing

  • cj: CJ() with subsequent filtering
  • sj: langtang's non-equi self-join
  • rp: above base R approach using rep() and tail()

for problem sizes (length of input vector) of 10, 100, 1000, and 10k rows. Note that for the 10k case the result contains nearly 50 million combinations. So available memory is a limiting factor.

Note that bench::mark() verifies that all approaches return the same result.

library(bench)
library(data.table)
bm <- press(
  n_id = 10^(4:1),
  {
    init0 <- data.table(id = sprintf("%09i", seq(n_id)))
    cat(sprintf("Number of combinations in result: %g\n", (n_id-1)*n_id/2))
    mark(
      rp = {
        init <- copy(init0)
        init[, .(id1 = rep(id, rev(seq_along(id) - 1L)),
                 id2 = unlist(lapply(-seq_along(id), tail, x = id)))]
      },
      sj = {
        init <- copy(init0)
        init[, id2 := as.factor(id)][
          init, on = .(id2 > id2), nomatch = 0, .(id1 = as.character(id2), id2 = id)]
      },
      cj = {
        init <- copy(init0)
        init[, CJ(id1 = id, id2 = id)][id1 < id2]
      },
      check = function(x, y) {
        tmp <- all.equal(x, y, check.attributes = FALSE)
          if (!isTRUE(tmp)) {
            cat(tmp, "\n")
            str(x)
            str(y)
          }
        tmp
      },
      min_iterations = 1L
    )
  }
)

bm
   expression  n_id      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   memory  time 
   <bch:expr> <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   <list>  <lis>
 1 rp         10000     1.8s     1.8s    0.556     2.05GB   0.556      1     1       1.8s <data.t… <Rprof… <ben…
 2 sj         10000    2.87s    2.87s    0.348     2.24GB   0.696      1     2      2.87s <data.t… <Rprof… <ben…
 3 cj         10000   49.29s   49.29s    0.0203    3.17GB   0.0203     1     1     49.29s <data.t… <Rprof… <ben…
 4 rp          1000  22.11ms  22.53ms   44.2      21.15MB   2.10      21     1   475.58ms <data.t… <Rprof… <ben…
 5 sj          1000  18.28ms  18.88ms   52.4      23.19MB   2.02      26     1   496.27ms <data.t… <Rprof… <ben…
 6 cj          1000 542.97ms 542.97ms    1.84     32.53MB   0          1     0   542.97ms <data.t… <Rprof… <ben…
 7 rp           100   1.46ms   1.52ms  651.      285.94KB   6.56     298     3   457.53ms <data.t… <Rprof… <ben…
 8 sj           100   2.14ms   2.21ms  450.      406.48KB   4.24     212     2   471.37ms <data.t… <Rprof… <ben…
 9 cj           100   6.92ms   6.97ms  143.      431.23KB   0         72     0   504.44ms <data.t… <Rprof… <ben…
10 rp            10  679.8µs 738.38µs 1343.       66.34KB   6.60     610     3   454.25ms <data.t… <Rprof… <ben…
11 sj            10   1.71ms   1.81ms  552.      156.17KB   4.22     262     2   474.47ms <data.t… <Rprof… <ben…
12 cj            10 915.48µs 967.99µs 1029.      100.83KB   4.21     489     2   475.13ms <data.t… <Rprof… <ben…
library(ggplot2)
autoplot(bm)

enter image description here

Conclusion

cj is magnitudes slower than the other 2 approaches for larger use cases while rp is on a par with sj but allocates slightly less memory.

Upvotes: 1

langtang
langtang

Reputation: 24845

You can use a non equi join after converting to factor

dt[, id2:=as.factor(id)][dt, on=.(id2>id2),nomatch=0,.(id, id2)]

Upvotes: 2

Related Questions