sup1214
sup1214

Reputation: 1

How to Resolve Duplicate Vector Matches in Redis Vector Store Using Langchain Framework?

I'm currently using Redis Vector Store in conjunction with the Langchain framework. My application is configured to retrieve four distinct chunks, but I've noticed that sometimes all four chunks are identical. This is causing some inefficiencies and isn't the expected behavior. Does anyone know why this might be happening and have any recommendations on how to resolve it?

def getVectorStore(database: str, index_name: str = "KU_RULE_05") -> Redis:
    if database not in vectorstore:
        raise ValueError(f"{database} does not exist in vectorstore list in utils.py")

    if database == "Redis":
        VectorStore = Redis.from_existing_index(
            embedding=embedding(),
            redis_url=os.getenv("REDIS_URL"),
            index_name=index_name)

    return VectorStore

def getRelatedDocs(content: str, database="Redis"):
    VectorStore = getVectorStore(database=database, index_name=index_name)
    RelatedDocs = []

    for index, documents in enumerate(VectorStore.similarity_search(query=content)):
        RelatedDocs.append("{}: {}".format(index + 1, documents.page_content))
    return RelatedDocs

We've thoroughly checked for any duplicate documents in the database to see if that could be the cause of the issue, but we found no duplicates.

Upvotes: 0

Views: 1434

Answers (1)

Spartee
Spartee

Reputation: 21

Ok so most likely you are continuing to use the from_documents method in the getVectorStore function when you should actually be using the from_existing_index method. You're likely re-generating and uploading the embeddings each time, each with a unique UUID, hence the duplicates.

The flow for reusing an index once created (as it is in from_documents) is:

  1. from_existing_index (make sure to pass schema if using metadata)
  2. then either use as retriever in a chain as_retriever or use the search methods directly, i.e. similarity_search.

example


from langchain.vectorstores.redis import Redis

metadata = [
    {
        "user": "john",
        "age": 18,
        "job": "engineer",
        "credit_score": "high",
    },
    {
        "user": "derrick",
        "age": 45,
        "job": "doctor",
        "credit_score": "low",
    }
]
texts = ["foo", "foo"]

rds = Redis.from_texts(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users"
)
results = rds.similarity_search("foo")
print(results[0].page_content)

then to init an index that exists you can do



new_rds = Redis.from_existing_index(
    embeddings,
    index_name="users",
    redis_url="redis://localhost:6379",
    schema="redis_schema.yaml"
)
results = new_rds.similarity_search("foo", k=3)
print(results[0].metadata)

Notice, I'm passing the schema above. If you're using metadata, you can write out the schema file using the write_schema method.

# write the schema to a yaml file
new_rds.write_schema("redis_schema.yaml")

I highly recommend going through the documentation for the newer release of the redis integrations as well

https://python.langchain.com/docs/integrations/vectorstores/redis

Ok now given your code, I'm not positive what's actually causing this error since you're positive you've curated your database contents, but could you try


def getRelatedDocs(content: str, database="Redis"):
    VectorStore = getVectorStore(database=database, index_name=index_name)
    RelatedDocs = []

    docs = VectorStore.similarity_search(query=content)
    for i, document in enumerate(docs, start=1):
        RelatedDocs.append(f"{i}: {doc.page_content}")
    return RelatedDocs

If this doesn't work I would try to run simpler examples with your codebase and see if a more trivial example works.

Upvotes: 0

Related Questions