Reputation: 1
I'm currently using Redis Vector Store in conjunction with the Langchain framework. My application is configured to retrieve four distinct chunks, but I've noticed that sometimes all four chunks are identical. This is causing some inefficiencies and isn't the expected behavior. Does anyone know why this might be happening and have any recommendations on how to resolve it?
def getVectorStore(database: str, index_name: str = "KU_RULE_05") -> Redis:
if database not in vectorstore:
raise ValueError(f"{database} does not exist in vectorstore list in utils.py")
if database == "Redis":
VectorStore = Redis.from_existing_index(
embedding=embedding(),
redis_url=os.getenv("REDIS_URL"),
index_name=index_name)
return VectorStore
def getRelatedDocs(content: str, database="Redis"):
VectorStore = getVectorStore(database=database, index_name=index_name)
RelatedDocs = []
for index, documents in enumerate(VectorStore.similarity_search(query=content)):
RelatedDocs.append("{}: {}".format(index + 1, documents.page_content))
return RelatedDocs
We've thoroughly checked for any duplicate documents in the database to see if that could be the cause of the issue, but we found no duplicates.
Upvotes: 0
Views: 1434
Reputation: 21
Ok so most likely you are continuing to use the from_documents
method in the getVectorStore
function when you should actually be using the from_existing_index
method. You're likely re-generating and uploading the embeddings each time, each with a unique UUID, hence the duplicates.
The flow for reusing an index once created (as it is in from_documents
) is:
from_existing_index
(make sure to pass schema if using metadata)as_retriever
or use the search methods directly, i.e. similarity_search
.example
from langchain.vectorstores.redis import Redis
metadata = [
{
"user": "john",
"age": 18,
"job": "engineer",
"credit_score": "high",
},
{
"user": "derrick",
"age": 45,
"job": "doctor",
"credit_score": "low",
}
]
texts = ["foo", "foo"]
rds = Redis.from_texts(
texts,
embeddings,
metadatas=metadata,
redis_url="redis://localhost:6379",
index_name="users"
)
results = rds.similarity_search("foo")
print(results[0].page_content)
then to init an index that exists you can do
new_rds = Redis.from_existing_index(
embeddings,
index_name="users",
redis_url="redis://localhost:6379",
schema="redis_schema.yaml"
)
results = new_rds.similarity_search("foo", k=3)
print(results[0].metadata)
Notice, I'm passing the schema above. If you're using metadata, you can write out the schema file using the write_schema
method.
# write the schema to a yaml file
new_rds.write_schema("redis_schema.yaml")
I highly recommend going through the documentation for the newer release of the redis integrations as well
https://python.langchain.com/docs/integrations/vectorstores/redis
Ok now given your code, I'm not positive what's actually causing this error since you're positive you've curated your database contents, but could you try
def getRelatedDocs(content: str, database="Redis"):
VectorStore = getVectorStore(database=database, index_name=index_name)
RelatedDocs = []
docs = VectorStore.similarity_search(query=content)
for i, document in enumerate(docs, start=1):
RelatedDocs.append(f"{i}: {doc.page_content}")
return RelatedDocs
If this doesn't work I would try to run simpler examples with your codebase and see if a more trivial example works.
Upvotes: 0