How to calculate a lot of records in DB with reasonable time

Question

If I have a vector (for example: (5,4,6,8) ) in my application and I want to find similarity to other vector in my DB, let say for simplicity that I'm calculating distance between two vectors with Manhattan distance.

What I need is a way to calculate the algorithm (Manhattan distance in my example) between my vector and all the vectors that are stored in my DB, Can I do 10 million vectors under a couple of seconds ?

Bartłomiej Twardowski · Accepted Answer

If You really deal with a lot of data, what You really need is an Approximate Near Neighborhood - http://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximate_nearest_neighbor implementation. Take look at Annoy - https://pypi.python.org/pypi/annoy/1.8.0 project page. There is a benchmark with other ANN projects wich You can find interesting. Maybe there is a implementation as a plugin for DB, but I am not aware of such. However, ANN can be also used to pre-compute top-n NN and store them in DB as a list for User/Item.

How to calculate a lot of records in DB with reasonable time

Answers (1)

Related Questions