Reputation: 2343
I have an app where users can sign up and fill out a profile. This profile consists of 16 questions that can be answered using a slider. Each "answer" for a question can be between -3 and 3 (or 0 and 7).
A user should be able to find similar users based on the results of the questions. I thought using a vector database like Weaviate or Pinecone could help me find these matches on demand, but unfortunately if I do simple experiments the similarity mostly 0.
Here is what I am doing in Pinecone:
Indexing:
const index = await initIndex()
const vectors = [
{
id: '1',
values: [-3, -3, -3, -3, -3]
},
{
id: '2',
values: [-1, -1, -1, -1, -1]
},
{
id: '3',
values: [0, 0, 0, 0, 0]
},
{
id: '4',
values: [1, 1, 1, 1, 1]
},
{
id: '5',
values: [3, 3, 3, 3, 3]
}
] as Vector[]
const upsertRequest: UpsertRequest = {
vectors
}
await index.upsert({
upsertRequest,
})
Searching:
const index = await initIndex()
const queryRequest = {
topK: 10,
vector: [0, 0, 0, 0, 0],
includeValues: true
}
const queryResponse = await index.query({ queryRequest })
Result:
{
"queryResponse": {
"results": [],
"matches": [
{
"id": "2",
"score": 0,
"values": [
-1,
-1,
-1,
-1,
-1
]
},
{
"id": "1",
"score": 0,
"values": [
-3,
-3,
-3,
-3,
-3
]
},
{
"id": "3",
"score": 0,
"values": [
0,
0,
0,
0,
0
]
},
{
"id": "5",
"score": 0,
"values": [
3,
3,
3,
3,
3
]
},
{
"id": "4",
"score": 0,
"values": [
1,
1,
1,
1,
1
]
}
],
"namespace": ""
}
}
Why is the score always 0? Shouldn't it be .5 based on the vectors in my database?
Upvotes: 1
Views: 1118
Reputation: 57798
So it took a little bit of work, but I actually did manage to reproduce this. I created a COSINE-based index and added the data that you mentioned above. I then queried by a vector which matched ID#3:
{'id': '1', 'score': 0.0, 'values': [-3.0, -3.0, -3.0, -3.0, -3.0]},
{'id': '5', 'score': 0.0, 'values': [3.0, 3.0, 3.0, 3.0, 3.0]},
{'id': '3', 'score': 0.0, 'values': [0.0, 0.0, 0.0, 0.0, 0.0]},
{'id': '2', 'score': 0.0, 'values': [-1.0, -1.0, -1.0, -1.0, -1.0]},
{'id': '4', 'score': 0.0, 'values': [1.0, 1.0, 1.0, 1.0, 1.0]}],
'namespace': ''}
Being a DataStax employee, I tried this on Astra DB, next:
CREATE TABLE users (
user_id INT PRIMARY KEY,
survey_vector VECTOR<Float,5>);
CREATE CUSTOM INDEX users ON users(survey_vector) USING 'StorageAttachedIndex';
INSERT INTO users (user_id, survey_vector) VALUES (1,[-3, -3, -3, -3, -3]);
INSERT INTO users (user_id, survey_vector) VALUES (2,[-1, -1, -1, -1, -1]);
INSERT INTO users (user_id, survey_vector) VALUES (3,[0, 0, 0, 0, 0]);
INSERT INTO users (user_id, survey_vector) VALUES (4,[1, 1, 1, 1, 1]);
INSERT INTO users (user_id, survey_vector) VALUES (5,[3, 3, 3, 3, 3]);
It failed on the INSERT
, where id=3
.
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed - received 0 responses and 3 failures: UNKNOWN from 10.16.22.38:7000, UNKNOWN from 10.16.12.4:7000, UNKNOWN from 10.16.8.4:7000" info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0, 'failures': 3}
Astra DB threw a similar error when I tried an ANN query.
TL;DR;
You can't run a cosine-based vector search with a vector full of zeros (aka: null vector), because that results in a divide-by-zero error. Astra DB correctly threw an error (a consistency error, but an error nonetheless).
Pinecone hides it. Not sure if it's silently failing, but it still gives you back the results. Although, it can't do anything about the score, so that's why they're all zeros.
Anyway, a search on a null vector does work with a Euclidean index/search. Recreate your index as "EUCLIDEAN," because you can have a null vector with that:
Pinecone with a Euclidean index:
{'matches': [
{'id': '3', 'score': 0.0, 'values': [0.0, 0.0, 0.0, 0.0, 0.0]},
{'id': '4', 'score': 5.0, 'values': [1.0, 1.0, 1.0, 1.0, 1.0]},
{'id': '2', 'score': 5.0, 'values': [-1.0, -1.0, -1.0, -1.0, -1.0]},
{'id': '1', 'score': 45.0,'values': [-3.0, -3.0, -3.0, -3.0, -3.0]},
{'id': '5', 'score': 45.0,'values': [3.0, 3.0, 3.0, 3.0, 3.0]}],
'namespace': ''}
Astra DB with a Euclidean index:
> CREATE CUSTOM INDEX users_survey_vector_idx ON
stackoverflow.users (survey_vector)
USING 'StorageAttachedIndex'
WITH OPTIONS = {'similarity_function': 'EUCLIDEAN'};
> SELECT user_id, similarity_euclidean(survey_vector,[0,0,0,0,0])
AS similarity FROM users
ORDER BY survey_vector
ANN OF [0,0,0,0,0] LIMIT 5;
user_id | similarity | survey_vector
---------+------------+----------------------
3 | 1 | [0, 0, 0, 0, 0]
2 | 0.166667 | [-1, -1, -1, -1, -1]
4 | 0.166667 | [1, 1, 1, 1, 1]
5 | 0.021739 | [3, 3, 3, 3, 3]
1 | 0.021739 | [-3, -3, -3, -3, -3]
(5 rows)
Edit for Dot Product
Why is the score always 0? Shouldn't it be .5 based on the vectors in my database?
Made an edit to cover my bases, in the event that you were originally using a Dot Product based index. When running it with Pinecone, I get the same result as the above output with the Cosine-based index; same order, scores = zero.
However, with Astra DB:
> CREATE CUSTOM INDEX users_survey_vector_idx ON
stackoverflow.users (survey_vector)
USING 'StorageAttachedIndex'
WITH OPTIONS = {'similarity_function': 'DOT_PRODUCT'};
> SELECT user_id, similarity_dot_product(survey_vector,[0,0,0,0,0])
AS similarity FROM users
ORDER BY survey_vector
ANN OF [0,0,0,0,0] LIMIT 5;
user_id | similarity | survey_vector
---------+------------+----------------------
5 | 0.5 | [3, 3, 3, 3, 3]
1 | 0.5 | [-3, -3, -3, -3, -3]
2 | 0.5 | [-1, -1, -1, -1, -1]
4 | 0.5 | [1, 1, 1, 1, 1]
3 | 0.5 | [0, 0, 0, 0, 0]
(5 rows)
Now, I'm not sure why Pinecone isn't computing scores for Dot Product. But Astra DB seems to handle this one with scores that match what you were expecting.
Upvotes: 0