Reputation: 391
import os
from langchain.llms import OpenAI
import bs4
import langchain
from langchain import hub
from langchain.document_loaders import UnstructuredFileLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
os.environ["OPENAI_API_KEY"] = "KEY"
loader = UnstructuredFileLoader(
'path_to_file'
)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retrieved_docs = retriever.get_relevant_documents(
"What is X?"
)
This returns:
[Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932})]
Which is all seemingly the same document.
When I first ran this code in Google Colab/Jupyter Notebook, it returned different documents...as I ran it more, it started returning the same documents. Makes me feel like this is a database issue, where the same entry is being inserted into the database with each run.
How do I return 6 different unique documents?
Upvotes: 7
Views: 4510
Reputation: 746
I wrote this simple function to find the unique values of the embedded docs in a chroma db vector store, it iterates through all the source files that are duplicated and outputs the unique values:
## get list of all file URLs in vector db
def get_unique_files():
db = vectordb
print("\nEmbedding keys:", db.get().keys())
print("\nNumber of embedded docs:", len(db.get()["ids"]))
# Print the list of source files
# for x in range(len(db.get()["ids"])):
# # print(db.get()["metadatas"][x])
# doc = db.get()["metadatas"][x]
# source = doc["source"]
# print(source)
# db.get()
file_list = []
for x in range(len(db.get()["ids"])):
doc = db.get()["metadatas"][x]
source = doc["source"]
# print(source)
file_list.append(source)
### Set only stores a value once even if it is inserted more than once.
list_set = set(file_list)
unique_list = (list(list_set))
print("\nList of unique files in db:\n")
for unique_file in unique_list:
print(unique_file)
issue the function with:
get_unique_files()
This will output only the individual files that were used for the embedding content:
Embedding keys: dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'])
Number of embedded docs: 140
List of unique files in db:
pdf-files/leadership-team.pdf
pdf-files/report-summary.pdf
csv-files/small-csv.csv
ppt-content/presentation.pptx
csv-files/dataset-04-17-2024.csv
Upvotes: 0
Reputation: 49571
the issue is here:
Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
everytime you execute the file, you are inserting the same documents into the database.
you could comment out that part of code if you are inserting from same file. or you could detect the similar vectors using EmbeddingsRedundantFilter
Filter that drops redundant documents by comparing their embeddings.
Upvotes: 11