Reputation: 1001
I have a data table init
that looks like this:
> init
+---+
| id|
+---+
| a|
+---+
| b|
+---+
| c|
+---+
I want to obtain all pairs for id
column, so I need to cross join the init
data table on itself. Additionally, I want to exclude equal and symmetric results (in my case a,b == b,a
, etc.).
Desired output is:
+---+---+
|id1|id2|
+---+---+
| a| b|
| a| c|
| b| c|
+---+---+
How can this be done with the data.table
approach?
Full cross join can be implemented as
full_cj <- CJ(init$id, init$id)
:
+---+---+
| V1| V2|
+---+---+
| a| a|
| a| b|
| a| c|
| b| a|
| b| b|
| b| c|
| c| a|
| c| b|
| c| c|
+---+---+
But how can I remove symmetrical and identical results from the output?
My real data is huge, so I'm looking for an efficient solution.
Upvotes: 3
Views: 137
Reputation: 42592
For the sake of completeness, this can be achieved as well using base R by
init[, .(id1 = rep(id, rev(seq_along(id) - 1L)),
id2 = unlist(lapply(-seq_along(id), tail, x = id)))]
As the OP has mentioned
My real data is huge, so I'm looking for an efficient solution.
here is a benchmark comparing
rep()
and tail()
for problem sizes (length of input vector) of 10, 100, 1000, and 10k rows. Note that for the 10k case the result contains nearly 50 million combinations. So available memory is a limiting factor.
Note that bench::mark()
verifies that all approaches return the same result.
library(bench)
library(data.table)
bm <- press(
n_id = 10^(4:1),
{
init0 <- data.table(id = sprintf("%09i", seq(n_id)))
cat(sprintf("Number of combinations in result: %g\n", (n_id-1)*n_id/2))
mark(
rp = {
init <- copy(init0)
init[, .(id1 = rep(id, rev(seq_along(id) - 1L)),
id2 = unlist(lapply(-seq_along(id), tail, x = id)))]
},
sj = {
init <- copy(init0)
init[, id2 := as.factor(id)][
init, on = .(id2 > id2), nomatch = 0, .(id1 = as.character(id2), id2 = id)]
},
cj = {
init <- copy(init0)
init[, CJ(id1 = id, id2 = id)][id1 < id2]
},
check = function(x, y) {
tmp <- all.equal(x, y, check.attributes = FALSE)
if (!isTRUE(tmp)) {
cat(tmp, "\n")
str(x)
str(y)
}
tmp
},
min_iterations = 1L
)
}
)
bm
expression n_id min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis> 1 rp 10000 1.8s 1.8s 0.556 2.05GB 0.556 1 1 1.8s <data.t… <Rprof… <ben… 2 sj 10000 2.87s 2.87s 0.348 2.24GB 0.696 1 2 2.87s <data.t… <Rprof… <ben… 3 cj 10000 49.29s 49.29s 0.0203 3.17GB 0.0203 1 1 49.29s <data.t… <Rprof… <ben… 4 rp 1000 22.11ms 22.53ms 44.2 21.15MB 2.10 21 1 475.58ms <data.t… <Rprof… <ben… 5 sj 1000 18.28ms 18.88ms 52.4 23.19MB 2.02 26 1 496.27ms <data.t… <Rprof… <ben… 6 cj 1000 542.97ms 542.97ms 1.84 32.53MB 0 1 0 542.97ms <data.t… <Rprof… <ben… 7 rp 100 1.46ms 1.52ms 651. 285.94KB 6.56 298 3 457.53ms <data.t… <Rprof… <ben… 8 sj 100 2.14ms 2.21ms 450. 406.48KB 4.24 212 2 471.37ms <data.t… <Rprof… <ben… 9 cj 100 6.92ms 6.97ms 143. 431.23KB 0 72 0 504.44ms <data.t… <Rprof… <ben… 10 rp 10 679.8µs 738.38µs 1343. 66.34KB 6.60 610 3 454.25ms <data.t… <Rprof… <ben… 11 sj 10 1.71ms 1.81ms 552. 156.17KB 4.22 262 2 474.47ms <data.t… <Rprof… <ben… 12 cj 10 915.48µs 967.99µs 1029. 100.83KB 4.21 489 2 475.13ms <data.t… <Rprof… <ben…
library(ggplot2)
autoplot(bm)
cj
is magnitudes slower than the other 2 approaches for larger use cases while rp
is on a par with sj
but allocates slightly less memory.
Upvotes: 1
Reputation: 24845
You can use a non equi join after converting to factor
dt[, id2:=as.factor(id)][dt, on=.(id2>id2),nomatch=0,.(id, id2)]
Upvotes: 2