Bakaburg
Bakaburg

Reputation: 3321

Fast computation of > 10^6 cosine vector similarities in R

I got a document term matrix of ~1600 documents x ~120 words. I would like to compute the cosine similarity between all these vectors, but we are speaking about ~1,300,000 comparisons [n * (n - 1) / 2].

I used parallel::mclapply with 8 but it still takes forever.

Which other solution do you suggest?

Thanks

Upvotes: 0

Views: 1905

Answers (3)

mlinegar
mlinegar

Reputation: 1399

The coop package's coop::cosine function is probably the best way to do this now. It is implemented in Rcpp, but also has a different approach than lsa::cosine, and also has lower memory overhead. Its use is exactly the same as lsa::cosine, just switch out the package names.

For further speedups, you may want to change your BLAS library. The coop manual has a few basic details and suggestions.

Upvotes: 2

ekstroem
ekstroem

Reputation: 6191

Here's my take on it.

If I define cosine similarity as

coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colSums(x^2))))}

(I think that is about as quickly as I can make it with base R functions and the often overseen crossprod which is a little gem). If I compare it with an RCpp function using RCppArmadillo (slightly updated as suggested by @f-privé)

NumericMatrix cosine_similarity(NumericMatrix x) {
  arma::mat X(x.begin(), x.nrow(), x.ncol(), false);

  // Compute the crossprod                                                                                      
  arma::mat res = X.t() * X;
  int n = x.ncol();
  arma::vec diag(n);
  int i, j;

  for (i=0; i<n; i++) {
    diag(i) = sqrt(res(i,i));
  }

  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
      res(i, j) /= diag(i)*diag(j);

  return(wrap(res));
}

(this might possibly be optimised with some of the specialized functions in the armadillo library - just wanted to get some timing measurements).

Comparing those yields

> XX <- matrix(rnorm(120*1600), ncol=1600)
> microbenchmark::microbenchmark(cosine_similarity(XX), coss(XX), coss2(XX), times=50)
> microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50)
Unit: milliseconds
                  expr      min       lq     mean   median       uq      max
               coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206
              coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934
 cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792
 neval cld
    50  a 
    50  b 
    50  a 

which is really not that bad. The gain in computing the cosine similarity using C++ is super small (with @ f-privé's solution being fastest) so I'm guessing your timing issues are due to what you are doing to convert the text from the words to numbers and not when calculating the cosine similarity. Without knowing more about your specific code it is hard for us to help you.

Upvotes: 6

F. Priv&#233;
F. Priv&#233;

Reputation: 11728

I very agree with @ekstroem on the use of crossprod but I think there are unnecessary computations in his implementation. I think by the way that coss is giving a wrong result. Comparing his answer with mine you can use this cpp file:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix cosine_similarity(NumericMatrix x) {                                                             

  arma::mat X(x.begin(), x.nrow(), x.ncol(), false);
  arma::mat rowSums = sum(X % X, 0);
  arma::mat res;

  res = X.t() * X / sqrt(rowSums.t() * rowSums);

  return(wrap(res));
}

// [[Rcpp::export]]
NumericMatrix& toCosine(NumericMatrix& mat,
                        const NumericVector& diag) {

  int n = mat.nrow();
  int i, j;

  for (j = 0; j < n; j++) 
    for (i = 0; i < n; i++) 
      mat(i, j) /= diag(i) * diag(j);

  return mat;
}

/*** R
coss <- function(x) { 
  crossprod(x)/(sqrt(crossprod(x^2)))
}

coss2 <- function(x) {
  cross <- crossprod(x)
  toCosine(cross, sqrt(diag(cross)))
}

XX <- matrix(rnorm(120*1600), ncol=1600)

microbenchmark::microbenchmark(
  cosine_similarity(XX), 
  coss(XX), 
  coss2(XX),
  times = 20
)
*/

Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
 cosine_similarity(XX) 172.1943 176.4804 181.6294 181.6345 185.7542 199.0042    20
              coss(XX) 262.6167 270.9357 278.8999 274.4312 276.1176 337.0531    20
             coss2(XX) 134.6742 137.6013 147.3153 140.4783 146.5806 204.2115    20

So, I will definility go for computing the crossprod in base R and then do the scaling in Rcpp.

PS: If you have a very sparse matrix, you could use package Matrix to convert your matrix to a sparse matrix. This new class of matrix also have the crossprod method so you could use coss2 as well.

Upvotes: 2

Related Questions