Reputation: 2188

LSH Libraries in Java

I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points.

It's totally good enough to get all entries in a bucket for a query. Which ones i really need could then be processed in a different way under consideration of some filter parameters my problem include.

I already found likelike but hope that there is something a bit smaller and without need of any other tools (like Apache Hadoop in the case of likelike).

Upvotes: 22

Answers (5)

Raman

Reputation: 19585

Apache Spark has an LSH implementation: https://spark.apache.org/docs/2.1.0/ml-features.html#locality-sensitive-hashing (API).

After having played with both the tdebatty and TarsosLSH implementations, I'll likely use Spark, as it supports sparse vectors as input. The tdebatty requires a non-sparse array of booleans or int's, and the TarsosLSH Vector implementation is a non-sparse array of doubles. This severely limits the number of dimensions one can reasonably support.

This page provides links to more projects, as well as related papers and information: https://janzhou.org/lsh/.

Upvotes: 2

nilsi

Reputation: 10761

Maybe this one:

"TarsosLSH is a Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works."

Code can be found here

Upvotes: 5

Has QUIT--Anony-Mousse

Reputation: 77454

The ELKI data mining framework comes with an LSH index. It can be used with most algorithms included (anything that uses range or nn searches) and sometimes works very well.

In other cases, LSH doesn't seem to be a good approach. It can be quite tricky to get the LSH parameters right: if you choose some parameters too high, runtime grows a lot (all the way to a linear scan). If you choose them too low, the index becomes too approximative and loses to many neighbors.

It's probably the biggest challenge with LSH: finding good parameters, that yield the desired speedup and getting a good enough accuracy out of the index...

Upvotes: 1