Reputation: 54
I am trying to use milvus for image similarity search involving 100 million+ embeddings. We have a different column-oriented distributed db table outside of milvus to annotate hundreds of metas for a given image like latitude, longitude, # of pedestrians, etc.
When we do search, we want to find top K images which also satisfying metas filtering condition. So we first make a call against above meta table to get all image ids satisfying specific meta conditions. (e.g., latitude > x and longitude < y and # of pedestrians > 0) We then want to find top K results using the text query embedding plus a filtering condition potentially involving lots of images ids like "image_id == id1 OR image_id == id2 OR ... OR image_id == id1000000".
So far the latency for this top K + id filtering query against milvus is large and we are trying to optimize it.
One way is to also store all metas inside milvus so we don't have to filter by tons of OR for selected ids. However, there are lots of metas and they are already being used by other use cases so at this stage we don't want to maintain a separate copy inside milvus along with image embeddings.
On the other hand, inside milvus, we can also apply indexing based on image_id. After that, we can partition the data based on image_id into N partitions, on application level split filtering id sets into N batches (one for a different partition), then split to make N queries for top K inside each partition and finally, merge the results for final top K. This sounds something that may help but requires non-trival logic on application level.
We want to understand whether there is similar case like us already and what are some good ways to speed up the query in such scenario? Thanks in advance for your time/help!
Upvotes: 0
Views: 85
Reputation: 193
If you are looking for meta data of the topK embeddings, here is what I would suggest doing:
latitude > x and longitude < y and pedestrians > 0
results = collection.search(
data=[text query embedding],
limit=topk,
expr="latitude > x and longitude < y and pedestrians > 0",
anns_field="xxxx",
param={},
)
The results will contain topK items with ID-distance pairs.
Upvotes: 1