user2806363
user2806363

Reputation: 2593

How to compute first eigen value very fast for large matrices?

I have a very large matrix (20000 * 20000) and I would like to compute its largest eigenvalue. When I do it in Matlab, it takes couple of seconds, but in R, it takes over an hour to compute. Currently I'm using rARPACK and it takes hours to compute.

library("rARPACK")
eigs_sym(cov(TS), k = 1, which = "LM", opts = list(retvec = FALSE))

Any alternative or solutions ?

Upvotes: 1

Views: 1370

Answers (3)

Matifou
Matifou

Reputation: 8880

How are you creating your large matrix M? If it is produced as the product of two matrices say M = XX', remember that the first eigenvalues of XX' are equal to the first eigenvalues of X'X. This means that if X has many less rows than columns (or vice-versa) you could compute values from a much smaller matrix!

Code assuming M= XX' and X has 200 rows and 20 columns. Computing the first five eigenvalues from XX' (200 by 200) is around 30 times slower than computing from X'X (20 by 20):

N <- 200
p <- 20
X <- matrix(rnorm(N*p), N, p)
XXt <- tcrossprod(X)
XtX <- crossprod(X)

eigen(XXt)$values[1:5]
#> [1] 313.1307 288.9510 261.9685 260.7008 236.8953
eigen(XtX)$values[1:5]
#> [1] 313.1307 288.9510 261.9685 260.7008 236.8953

microbenchmark::microbenchmark(large_M = eigen(XXt)$values[1:5],
                               small_M = eigen(XtX)$values[1:5],
                               times =50,
                               check="equal")
#> Unit: microseconds
#>     expr      min       lq       mean   median        uq       max neval cld
#>  large_M 9178.500 9434.875 10759.3923 10354.18 11416.863 15569.834    50   b
#>  small_M  233.601  247.493   325.8577   312.32   373.146   562.687    50  a

Created on 2021-08-27 by the reprex package (v2.0.1)

Upvotes: 1

Matifou
Matifou

Reputation: 8880

I was also surprised that RSpectra::eigs_sym was not soooo much faster than base::eigen. I did a benchmark comparing both over various sample sizes and number of eigenvalues.

In general, RSpectra::eigs_sym is 3-5 times faster than base::eigen. But given that base::eigen will take an eternity for a 20000 * 20000, RSpectra::eigs_sym still should take a good quarter of eternity!

enter image description here

library(RSpectra)
library(tidyverse)
library(microbenchmark)
library(clusterGeneration)
library(tictoc)

## generate S
S_dat <- tibble(size = c(50, 100, 500, 1000, 2000, 5000)) %>% 
  mutate(S=map(size, ~genPositiveDefMat(.)$Sigma))


### benvhmark function
bench_oneS <- function(S, n_eigens=5){
  dim <- nrow(S)
  n_rep <- case_when(dim<=1000 ~20,
                     TRUE~10)
  microbenchmark("base::eigen" = eigen(S)$values[1:n_eigens],
                 "RSpectra::eigs_sym" =eigs_sym(S,n_eigens, opts = list(retvec = FALSE))$values,
                 times=n_rep) %>% 
    summary(a, unit="ms") %>% 
    as_tibble() %>% 
    mutate(expr = as.character(expr))
}

## do benchmark
tic()
bench_out <- S_dat %>% 
  mutate(bench_5=map(S, ~bench_oneS(.)),
         bench_1=map(S, ~bench_oneS(., n_eigens=1))) %>% 
  dplyr::select(-S)%>% 
  pivot_longer(starts_with("bench"), names_to="n_eigens", values_to="bench",
               names_prefix="bench_", names_transform=list(n_eigens=as.integer)) %>% 
  unnest(bench)
toc()

##
pl_ratio <- bench_out %>% 
  dplyr::select(size, n_eigens, expr, mean) %>% 
  spread(expr, mean) %>% 
  mutate(ratio=`base::eigen` /`RSpectra::eigs_sym`) %>% 
  ggplot(aes(x=size, y= ratio, linetype=factor(n_eigens))) +
  geom_line()+
  geom_point()+
  labs(color="Method:", linetype="N eigens")+
  theme(legend.position="bottom")+
  ylab("Ratio time (eigen/eigs_sym)")+
  xlab("Sample size")+
  ggtitle("Ratio of average timing, base::eigen vs RSpectra::eigs_sym")

Upvotes: 1

David
David

Reputation: 356

rARPACK used to be a wrapper around the library ARPACK but is now a wrapper around Spectra an improved and reimplemented version of the same algorithms, which outperforms ARPACK in many tests.

As the matlab eigs function is also a wrapper around the ARPACK package it seems unlikely that the issue is the solver if you have the same arguments in both cases.

Is it possible that the step taking the time is the calculation of the covariance matrix (i.e. cov(TS))?

Upvotes: 2

Related Questions