mrgloom
mrgloom

Reputation: 21632

Knn search for large data?

I'm interested in performing knn search on large dataset.

There are some libs: ANN and FLANN, but I'm interested in the question: how to organize the search if you have a database that does not fit entirely into memory(RAM)?

Upvotes: 2

Views: 7815

Answers (2)

Ando Saabas
Ando Saabas

Reputation: 1977

It depends if your data is very high-dimensional or not. If it is relatively low-dimensional, you can use an existing on-disk R-Tree implementation, such as Spatialite.

If it is a higher dimensional data, you can use X-Trees, but I don't know of any on-disk implementations off the top of my head.

Alternatively, you can implement locality sensitive hashing with on disk persistence, for example using mmap.

Upvotes: 4

JonasVautherin
JonasVautherin

Reputation: 8033

I suppose it depends on how much bigger your index is in comparison to the memory. Here are my first spontaneous ideas:

  1. Supposing it was tens of times the size of the RAM, I would try to cluster my data using, for instance, hierarchical clustering trees (implemented in FLANN). I would modify the implementation of the trees so that they keep the branches in memory and save the leaves (the clusters) on the disk. Therefore, the appropriate cluster would have to be loaded each time. You could then try to optimize this in different ways.

  2. If it was not that bigger (let's say twice the size of the RAM), I would separate the dataset in two parts and create one index for each. I would therefore need to find the nearest neighbor in each dataset and then choose between them.

Upvotes: 4

Related Questions