megmac
megmac

Reputation: 509

Linear sum assignment/Hungarian method performance in R

I need to speed up a process of finding most optimal distance for each entry. I am using gower.dist from StatMatch and solve_LSAP from the clue package. The gower distance takes no time at all, however the LSAP solver takes too long with the number of times I need to run it.

Is there a way to make this run faster using parallel computing or just making part of it run in parallel [link to clue github] [link to scientific journal discussing this] or another solver that I may be unaware of that is faster? The other two libraries I am aware of are adagio and RcppHungarian (both are slower).

Example data: Gower Distance Data (google drive link to folder with data)

> dim(gowerdist)
[1]  4309 10366
solve_LSAP(gowerdist, maximum = FALSE)

Upvotes: 1

Views: 498

Answers (2)

Martin Smith
Martin Smith

Reputation: 4077

clue::solve_LSAP uses the Hungarian algorithm. The Jonker and Volgenant (1987) algorithm, implemented in TreeDist::LAPJV(), is more efficient.

I tested a reduced version of the matrix to get results in seconds rather than minutes; differences in run time are likely to increase with larger matrices.

gower <- read.csv("GowerDistance.csv")
dim(gower)
gowerMat <- as.matrix(gower)
gow1 <- gowerMat[1:2400, 1:4800]
tictoc::tic()
clue <- clue::solve_LSAP(gow1, maximum = FALSE)
tictoc::toc()
# 67.95 sec elapsed
tictoc::tic()
td <- TreeDist::LAPJV(gow1)
tictoc::toc()
# 20.24 sec elapsed

GraphAlignment::LinearAssignment() on Bioconductor also uses the LAPJV algorithm, but can only be applied to square matrices. (A workaround is to add extra rows to the matrix with extremely high values.)

Another alternative is lpSolve; this does not specify which algorithm it uses, but is much slower

tictoc::tic()
lps <- lpSolve::lp.assign(gowerMat[1:800, 1:1600]) # much smaller matrix
tictoc::toc()
# 364.67 sec elapsed

Upvotes: 1

user10917479
user10917479

Reputation:

If you can transform your problem to run on integers instead of doubles, it should be much faster. Not sure how practical it is for all use cases, but that may be an option. Still not blazing fast, but definitely an improvement.

gowerdist_all <- readr::read_csv("GowerDistance.csv")

gowerdist1 <- as.matrix(gowerdist_all[,-1])

gowerdist2 <- as.matrix(gowerdist_all[,-1]) * 10000
mode(gowerdist2) <- "integer"

This takes 548 seconds.

tictoc::tic()
z <- clue::solve_LSAP(gowerdist1, maximum = FALSE)
tictoc::toc()
# 547.92 sec elapsed

Rounded as an integer takes 116 seconds.

tictoc::tic()
z <- clue::solve_LSAP(gowerdist2, maximum = FALSE)
tictoc::toc()
# 115.64 sec elapsed

Upvotes: 0

Related Questions