fullStackChris
fullStackChris

Reputation: 1502

What's the best index distance metric for my Pinecone vector database, filled with a series of similarly formatted Markdown files?

As the title states, I'm wondering if I can get more insight into choosing a metric for my Pinecone database index. Currently, they offer 3 options to choose from. From their documentation, they are:

In my case, I have generated human-like descriptions of a fairly similar repeating dataset in Markdown files, however, I'm wondering if this just adds noise to my data since the only thing changing in each file are (mainly) the numbers. Imagine these document examples:

Document 1:

# August 11th, 2023
Today we sold 5 apples and 3 oranges.

Document 2:

# August 12th, 2023
Today we sold 2 apples and 6 oranges.

Document 3:

# August 13th, 2023
Today we sold 0 apples and 1 orange.

and so on...

Then, you could imagine queries to be something like "how many apples did we sell on August 12th, 2023?" I though this would be "simple enough" for a custom embedding, but results are far from correct most of the time! I am currently using the cosine index.

I have a variety of questions that I haven't been able to find clear answers to:

First, for this type of data, which index distance metric makes the most sense?

Second, am I overcomplicating this problem and I should just leave the dataset in a raw format (i.e. JSON)?

Third, it is possible to create a sort of 'summary' file, that I could give more weight to against the 'daily' documents in my queries? Or is the whole point of RAG that I DON'T need to weight the documents seperately, I can just 'trust' the initial retrieval? Such a summary file would include a variety of statistics that would be likely often queried for. (In my example, perhaps the total YTD sales of apples and oranges, and averages sales of apples and oranges per day)

Upvotes: 1

Views: 1508

Answers (2)

Marcos
Marcos

Reputation: 1465

You should use the same similarity metric used to train the model that created the embeddings.

For example, if you're using OpenAI (any of their GPT models so far) you should use cosine similarity.

References

Upvotes: 3

Aravind Ramachandran
Aravind Ramachandran

Reputation: 25

  1. You have to play around with different metrics to choose the best that works for your data.
  2. If you have a logical separation available for your data, make use of the same in metadata using which you can filter while querying. In your case, Date can be used in metadata.
  3. Vector database tends to return top-k results based on similarity, So you might end up with getting wrong result at the top. So pass your question and top-k results(relevant chunks) returned by vector database to an LLM to come up with an accurate answer.

Hope I answered your question! Thanks.

Upvotes: 0

Related Questions