Reputation: 109874
Sometimes I want to use a double for loop with an index to columns in a matrix, compute some value between them and assign to a cell in a matrix. A correlation table is an example of this. I was wondering if/how this can be done in data.table syntax. Here's the example as a for loop. How can I do the same thing in *data.table** even if it is slower this is more can it be done though faster would be nice. Note that we can't assume the value computer will give a symmetric matrix (i.e., y[i, j]
!=
y[j, i]
necessarily).
cos_sim <- function(x, y) x %*% y / sqrt(x%*%x * y%*%y)
x <- mtcars
y <- matrix(, nrow = ncol(x), ncol = ncol(x))
for (i in 1:ncol(x)) {
for (j in 1:ncol(x)) {
y[i, j] <- cos_sim(x[, i], x[, j])
}
}
library(data.table)
x <- as.data.frame(x)
setDT(x)
Upvotes: 1
Views: 117
Reputation: 66819
Matrix algebra As far as efficiency goes, yeah, matrix operations are your best bet:
mx <- as.matrix(x)
sx <- 1 / sqrt( colSums(mx^2) )
res <- (t(mx) %*% mx) * (sx %*% t(sx))
This also gives you nice row and col labels, unlike the OP's for
loop.
data.table This isn't really natural here, but...
meltx <- melt(x[,id:=.I], id.var="id"); x[,id:=NULL]
cartx <- meltx[meltx, on="id", allow.cartesian=TRUE]
res2 <- dcast(cartx[, cos_sim(value, i.value), by=.(v1=variable,v2=i.variable)], v1~v2)
You get a data.table back out in this case, if that's a plus.
Upvotes: 1
Reputation: 887148
Another base R
approach would be outer
.
outer(x, x, FUN=Vectorize(cos_sim))
# mpg cyl disp hp drat wt qsec
#mpg 1.0000000 0.8566168 0.7356738 0.7794276 0.9768897 0.8483280 0.9660715
#cyl 0.8566168 1.0000000 0.9656088 0.9689702 0.9241079 0.9828563 0.9414552
#disp 0.7356738 0.9656088 1.0000000 0.9576400 0.8266655 0.9659344 0.8599014
#hp 0.7794276 0.9689702 0.9576400 1.0000000 0.8717482 0.9492708 0.8750691
#drat 0.9768897 0.9241079 0.8266655 0.8717482 1.0000000 0.9183274 0.9859895
#wt 0.8483280 0.9828563 0.9659344 0.9492708 0.9183274 1.0000000 0.9484697
#qsec 0.9660715 0.9414552 0.8599014 0.8750691 0.9859895 0.9484697 1.0000000
#vs 0.7753943 0.4700802 0.3356976 0.3742408 0.7022767 0.5143092 0.7130090
#am 0.7421732 0.5030698 0.3505303 0.5007184 0.7101727 0.4575882 0.6169362
#gear 0.9672733 0.9177938 0.8172070 0.8812034 0.9903890 0.9076279 0.9723964
#carb 0.7581483 0.9082799 0.8604485 0.9450793 0.8549106 0.8943285 0.8346877
# vs am gear carb
#mpg 0.7753943 0.7421732 0.9672733 0.7581483
#cyl 0.4700802 0.5030698 0.9177938 0.9082799
#disp 0.3356976 0.3505303 0.8172070 0.8604485
#hp 0.3742408 0.5007184 0.8812034 0.9450793
#drat 0.7022767 0.7101727 0.9903890 0.8549106
#wt 0.5143092 0.4575882 0.9076279 0.8943285
#qsec 0.7130090 0.6169362 0.9723964 0.8346877
#vs 1.0000000 0.5188745 0.6788292 0.3655971
#am 0.5188745 1.0000000 0.7435907 0.5766850
#gear 0.6788292 0.7435907 1.0000000 0.8802046
#carb 0.3655971 0.5766850 0.8802046 1.0000000
It can be also made into data.table
syntax, but the output is a matrix
, so I wouldn't say that there would be any improvement in efficiency.
setDT(x)[,outer(.SD, .SD, FUN=Vectorize(cos_sim))]
Upvotes: 2
Reputation: 9582
Here's one way:
x <- mtcars
setDT(x)
x[, lapply(.SD, function(xx) {
lapply(x, function(yy) cos_sim(xx, yy))
})]
The biggest difference between this and your original is really the use of apply in place of the for loops. It's data.table
-ish in that it makes use of .SD
, but one can also just do the following in base R:
sapply(x, function(xx) {
sapply(x, function(yy) cos_sim(xx, yy))
})
I think it's more svelte and preferable vs. nested for loops, but not sure it's really taking special advantage of data.table per se
Upvotes: 1