titi
titi

Reputation: 619

How to speed up R dist matrix for hierarchical clustering for large matrix input data?

I have a large matrix (approximately 35,000 x 35,000) and I'm preparing a distance object in R for hierarchical clustering. The base R function dist() is too slow, so I'm using the distances function from the distances package https://cran.r-project.org/web/packages/distances/distances.pdf. I have also implemented parallel processing to speed up the computation, but it still takes around 10 hours to run. Below is the code I am currently using. I utilize the final distance_matrix in the hclustgeo() function from the ClustGeo package.

Is there anyway that I speed this up, since I have another even bigger matrix (48,000 x 48,000) to run?

If I can not make this faster in R, should I switch to Python for better computing power?

Additionally, I have already tried the parDist() function from ParallelDist package, and it is not faster than distances(). Plus, there is a lot of my matrix cell values are 0. Is there a way not to calculate those pair to save some time?

###Calcuate the distance object
library(distances)
library(parallel)
library(ClustGeo)

###My data 'dat' is a 35k by 35k matrix

start <- Sys.time()
cl <- makeCluster(detectCores())
registerDoParallel(cl)

distance_matrix <- distance_matrix(distances(dat))

stopCluster(cl)

end <- Sys.time()
end - start  ###this runs about 10 hours for 35K by 35K matrix


####Hierarchical clustering using hclustgeo()
tree <- hclustgeo(distance_matrix)

Upvotes: 1

Views: 117

Answers (1)

jblood94
jblood94

Reputation: 17011

For large matrices of low density, a matrix multiplication version of the Euclidean distance calculation scales much better. A 35k x 35k (5% density) distance calculation took about 12 minutes on a laptop using the default BLAS.

library(Rfast) # for `Outer` (and `Dist` for speed comparisons)
library(Matrix) # for sparse matrix operations

bigSparseDist <- function(x) {
  rs2 <- rowSums(x^2)
  x <- tcrossprod(x, 2*x)
  sqrt(rep(rs2, each = length(rs2)) + rs2 - x)
}

Testing:

f <- function(n, d = 0.05) {
  # function to create a random sparse matrix of dimensions `n` x `n` and 
  # density `d`
  x <- sample(0:(n^2 - 1), 0.05*n^2)
  sparseMatrix(x%%n, x%/%n, dims = c(n, n), index1 = FALSE)
}

x <- f(100)
dx0 <- dist(x)
dx1 <- bigSparseDist(x)
dx2 <- Dist(as.matrix(x)) # Rfast::Dist
all.equal(c(as.matrix(dx0)), c(as.matrix(dx1)))
#> [1] TRUE
all.equal(as.matrix(dx1), dx2)
#> [1] TRUE

Compare timings with increasingly large sparse matrices.

x <- f(1e3)
system.time(dist(x))
#>    user  system elapsed 
#>    2.08    0.00    2.08
system.time(bigSparseDist(x))
#>    user  system elapsed 
#>    0.22    0.00    0.22
system.time(Dist(as.matrix(x)))
#>    user  system elapsed 
#>    0.23    0.00    0.23

x <- f(2e3)
system.time(dist(x))
#>    user  system elapsed 
#>   38.84    0.02   38.87
system.time(bigSparseDist(x))
#>    user  system elapsed 
#>    0.79    0.10    0.89
system.time(Dist(as.matrix(x)))
#>    user  system elapsed 
#>    2.17    0.01    2.19

x <- f(4e3)
# system.time(dist(x)) # takes too long
system.time(bigSparseDist(x))
#>    user  system elapsed 
#>    2.30    0.54    2.83
system.time(Dist(as.matrix(x)))
#>    user  system elapsed 
#>   18.53    0.06   18.61

And, finally:

x <- f(35e3)
system.time(bigSparseDist(x))
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 9.1 GiB
#>    user  system elapsed 
#>  444.11  174.92  716.36

Most of the time is spent in tcrossprod (and most of the rest in sqrt), which is highly optimized. The times above are for the default BLAS. Not sure if something like OpenBLAS would be faster.


MATLAB

FWIW, the same approach is about twice as fast in MATLAB for the 35k x 35k matrix, and it exhibits a similar divergence in timing against the built-in function pdist as the matrix grows.

Functions:

function x = randSparse(m, n, d)
    % function to create a random sparse matrix of dimensions `m` x `n` and
    % density `d`
    x = randperm(m*n, m*n*d);
    % x = randi([0, m*n], d*m*n, 1);
    x = sparse(mod(x, m) + 1, floor(x/m) + 1, rand(length(x), 1), m, n);
end

function x = bigSparseDist(x)
    rs2 = sum(x^2, 2);
    x = x*2*x';
    x = sqrt(rs2 + rs2' - x);
end

Timings:

>> x = randSparse(1e3, 1e3, 0.05);
>> tic; dx = pdist(x); toc
Elapsed time is 0.037151 seconds.
>> tic; dx = bigSparseDist(x); toc
Elapsed time is 0.161116 seconds.
>> x = randSparse(2e3, 2e3, 0.05);
>> tic; dx = pdist(x); toc
Elapsed time is 0.220821 seconds.
>> tic; dx = bigSparseDist(x); toc
Elapsed time is 0.118197 seconds.
>> tic; dx0 = pdist(x); toc
Elapsed time is 0.221399 seconds.
>> tic; dx1 = bigSparseDist(x); toc
Elapsed time is 0.110194 seconds.
>> x = randSparse(4e3, 4e3, 0.05);
>> tic; dx0 = pdist(x); toc
Elapsed time is 1.710372 seconds.
>> tic; dx1 = bigSparseDist(x); toc
Elapsed time is 0.469562 seconds.
>> x = randSparse(8e3, 8e3, 0.05);
>> tic; dx0 = pdist(x); toc
Elapsed time is 13.772585 seconds.
>> tic; dx1 = bigSparseDist(x); toc
Elapsed time is 2.655722 seconds.
>> x = randSparse(35e3, 35e3, 0.05);
>> tic; dx1 = bigSparseDist(x); toc
Elapsed time is 367.179998 seconds.

Upvotes: 1

Related Questions