Reputation: 7070
I build application that stores millions of floating point vectors, each vector having ~100 dimensions. With a query vector, I need to search through these vectors for the k nearest (euclidean) matches. Run time must be faster than scanning all millions of vectors. By "vector" I mean in the linear algebra term a list of ~100 floating point numbers i.e. [0.3, -15.7, 0.004, 457.1, ...]
I know databases like MySQL and MongoDB provide spatial indexes that work for 2 dimensions. Is there a way to adapt this to many more dimensions maybe with composite indexes? Or are there other other data stores support indexes on more dimension?
Upvotes: 7
Views: 1176
Reputation: 411
postgresql. It supports any number of dimensions you want. Default size is 100. This can be changed in cubedata.h file during installation. cube operator within postgres has r tree implemented for fast queries.
Upvotes: 1
Reputation: 1534
I can relate to your pain. There is no R-Tree type of implementation in MongoDB, I m not sure if there's one in SQL DBs. I found the following link useful:
http://www.slideshare.net/nknize/mongo-sv-knizefinal
Upvotes: 0
Reputation: 19621
If you are looking for exact matches, 100 dimensions is a lot. If you are prepared to settle for approximate matches, there is a class of Locality-Sensitive-Hashing schemes. You could generate a hash or series of hash values for you data sets and use an ordinary database or a 2-d spatial database to look for matches based on the hash value. One reference is http://people.csail.mit.edu/indyk/p117-andoni.pdf.
Upvotes: 3