Reputation: 11
I'm observing substantial differences in query performance while executing vector similarity search queries in Cassandra. Here's the context and details:
CREATE TABLE cycling.feature (
mall_id bigint,
place_id bigint,
hardware_id bigint,
feature_desc_id bigint,
occur_at timestamp,
vc vector<float,256>,
PRIMARY KEY ((mall_id), place_id, hardware_id, occur_at, feature_desc_id)
) WITH CLUSTERING ORDER BY (place_id ASC, hardware_id ASC, occur_at DESC, feature_desc_id DESC);
CREATE INDEX IF NOT EXISTS feature_ann_index_cos
ON cycling.feature(vc) USING 'sai'
WITH OPTIONS = { 'similarity_function': 'cosine' };
With mall_id Filter:
SELECT similarity_cosine(vc, ?) AS sim
FROM cycling.feature
WHERE mall_id = ?
ORDER BY vc ANN OF ? LIMIT 1;
Without mall_id Filter:
SELECT similarity_cosine(vc, ?) AS sim
FROM cycling.feature
ORDER BY vc ANN OF ? LIMIT 1;
The query with the mall_id filter is significantly slower than the one without, even though both are performing vector similarity searches.
I'm expecting the query with the mall_id filter to perform faster than the one without,
Upvotes: 1
Views: 105
Reputation: 4031
As your data size grows I would also expect the query with the id to end up faster (or at the least to scale better), especially if you have a cluster with more than "replication factor" nodes in it, as specifying the mail_id will target the query to a single node. But until your data set is pretty big having the mail_id in the query will actually slow it down because it means more comparisons need to occur to find the answer. You can use the TRACE feature in CQL to see that more work has to be done to filter out responses in the case of adding extra parameters to the query vs just returning the output of the ANN search.
Upvotes: 1