Reputation: 1502
As the title states, I'm wondering if I can get more insight into choosing a metric for my Pinecone database index. Currently, they offer 3 options to choose from. From their documentation, they are:
In my case, I have generated human-like descriptions of a fairly similar repeating dataset in Markdown files, however, I'm wondering if this just adds noise to my data since the only thing changing in each file are (mainly) the numbers. Imagine these document examples:
Document 1:
# August 11th, 2023
Today we sold 5 apples and 3 oranges.
Document 2:
# August 12th, 2023
Today we sold 2 apples and 6 oranges.
Document 3:
# August 13th, 2023
Today we sold 0 apples and 1 orange.
and so on...
Then, you could imagine queries to be something like "how many apples did we sell on August 12th, 2023?" I though this would be "simple enough" for a custom embedding, but results are far from correct most of the time! I am currently using the cosine index.
I have a variety of questions that I haven't been able to find clear answers to:
First, for this type of data, which index distance metric makes the most sense?
Second, am I overcomplicating this problem and I should just leave the dataset in a raw format (i.e. JSON)?
Third, it is possible to create a sort of 'summary' file, that I could give more weight to against the 'daily' documents in my queries? Or is the whole point of RAG that I DON'T need to weight the documents seperately, I can just 'trust' the initial retrieval? Such a summary file would include a variety of statistics that would be likely often queried for. (In my example, perhaps the total YTD sales of apples and oranges, and averages sales of apples and oranges per day)
Upvotes: 1
Views: 1508
Reputation: 1465
You should use the same similarity metric used to train the model that created the embeddings.
For example, if you're using OpenAI (any of their GPT models so far) you should use cosine similarity.
References
Upvotes: 3
Reputation: 25
Hope I answered your question! Thanks.
Upvotes: 0