Reputation: 3689
I have two sparse matrices A and B (slam::simple_triplet_matrix
) of the same MxN dimensions, where M = ~100K, N = ~150K.
I wish to calculate the cosine distance between each pair of rows (i.e. row 1 from matrix A and row 1 from matrix B, row 2 from matrix A and row 2 from matrix B, etc.).
I can do this using a for-loop or using apply
function but that's too slow, something like:
library(slam)
A <- simple_triplet_matrix(1:3, 1:3, 1:3)
B <- simple_triplet_matrix(1:3, 3:1, 1:3)
cosine <- NULL
for (i in 1:(dim(A)[1])) {
a <- as.vector(A[i,])
b <- as.vector(B[i, ])
cosine[i] <- a %*% b / sqrt(a%*%a * b%*%b)
}
I understand something in this previously asked question might help me, but:
a) This isn't really what I want, I just want M cosine distances for M rows, not all pairwise correlations between rows of a given sparse matrix.
b) I admit to not fully understanding the math behind this 'vectorized' solution so maybe some explanation would come in handy.
Thank you.
EDIT: This is also NOT a duplicate of this question as I'm not just interested in a regular cosine computation on two simple vectors (I clearly know how to do this, see above), I'm interested in a much larger scale situation, specifically involving slam sparse matrices.
Upvotes: 1
Views: 2250
Reputation: 7435
According to the documentation, element-by-element (array) multiplication of compatible simple_triplet_matrices
and row_sums
of simple_triplet_matrices
are available. With these operators/functions, the computation is:
cosineDist <- function(A, B){
row_sums(A * B) / sqrt(row_sums(A * A) * row_sums(B * B))
}
Notes:
row_sums(A * B)
computes the dot product of each row in A
and its corresponding row in B
, which is the numerator term in your cosine
. The result is a vector (not sparse) whose elements are these dot products for each corresponding row in A and B.row_sums(A * A)
computes the squared 2-norm of each row in A
. The result is a vector (not sparse) whose elements are these squared 2-norms for each row in A.row_sums(B * B)
computes the squared 2-norm of each row in B
. The result is a vector (not sparse) whose elements are these squared 2-norms for each row in B.Upvotes: 3
Reputation: 3696
cosineDist <- function(x){
as.dist(1 - x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
Upvotes: 0