beginner_
beginner_

Reputation: 7632

Updatable nearest neighbor search

I'm trying to come up with a good design for a nearest neighbor search application. This would be somewhat similar to this question:

Saving and incrementally updating nearest-neighbor model in R

In my case this would be in Python but the main point being the part that when new data comes, the model / index must be updated. I'm currently playing around with scikit-learn neighbors module but I'm not convinced it's a good fit.

The goal of the application:

User comes in with a query and then the n (probably will be fixed to 5) nearest neighbors in the existing data set will be shown. For this step such a search structure from sklearn would help but that would have to be regenerated when adding new records.Also this is a first ste that happens 1 per query and hence could be somewhat "slow" as in 2-3 seconds compared to "instantly".

Then the user can click on one of the records and see that records nearest neighbors and so forth. This means we are now within the exiting dataset and the NNs could be precomputed and stored in redis (for now 200k records but can be expanded to 10th or 100th of millions). This should be very fast to browse around.

But here I would face the same problem of how to update the precomputed data without having to do a full recomputation of the distance matrix especially since there will be very few new records (like 100 per week).

Does such a tool, method or algorithm exist for updatable NN searching?

EDIT April, 3rd:

As is indicated in many places KDTree or BallTree isn't really suited for high-dimensional data. I've realized that for a Proof-of-concept with a small data set of 200k records and 512 dimensions, brute force isn't much slower at all, roughly 550ms vs 750ms.

However for large data set in millions+, the question remains unsolved. I've looked at datasketch LSH Forest but it seems in my case this simply is not accurate enough or I'm using it wrong. Will ask a separate question regarding this.

Upvotes: 3

Views: 1795

Answers (3)

Eric L
Eric L

Reputation: 742

nmslib supports adding new vectors. It's used by OpenSearch as part their Similarity Search Engine, and it's very fast.

One caveat:

While the HNSW algorithm allows incremental addition of points, it forbids deletion and modification of indexed points.

You can also look into solutions like Milvus or Vearch.

Upvotes: 1

Lea
Lea

Reputation: 181

You could try out Milvus that supports adding and near real-time search of vectors.

Here are the benchmarks of Milvus.

Upvotes: 3

jokol
jokol

Reputation: 383

You should look into FAISS and its IVFPQ method What you can do there is create multiple indexes for every update and merge them with the old one

Upvotes: 3

Related Questions