Reputation: 163
I've been facing an issue for some time now and even though I read the ChromaDB documentation and tested different approaches I still am not able to resolve it.
I am getting the following error when I try to do a similarity search:
Number of requested results 3 is greater than number of elements in index 0,
Below is my script
import os
import openai
import sys
import pypdf
#set-up OPEN_AI API key
openai.api_key = os.environ["OPENAI_API_KEY"] #a restart was needed after the variable was set through the terminal
os.getcwd()
# Start with LangChain
# Import and use YouTube document loader
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain.document_loaders import PyPDFLoader
#start with one and then scale
url1="https://www.youtube.com/watch?v=wXj7Hzd8dOI" #Should you Change your job? J P Explains the risks of (not) quitting your job
url2="https://www.youtube.com/shorts/BnYK848GcAA" #How to handle emotional pain
url3="https://www.youtube.com/watch?v=wXj7Hzd8dOI" #https://www.youtube.com/shorts/4qMyHwmnQHk
save_dir="docs/youtube/"
loader = GenericLoader(
YoutubeAudioLoader([url1],save_dir),
OpenAIWhisperParser()
)
loader2 = GenericLoader(
YoutubeAudioLoader([url2],save_dir),
OpenAIWhisperParser()
)
loader3 = GenericLoader(
YoutubeAudioLoader([url3],save_dir),
OpenAIWhisperParser()
)
videos = []
videos.extend(loader.load())
videos.extend(loader2.load())
videos.extend(loader3.load())
print(len(videos))
#document splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 500,
chunk_overlap = 30
)
splits = text_splitter.split_documents(videos)
print(len(splits))
print(splits)
#embeddings
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
#persist_directory = 'chroma/'
#!rm -rf ./docs/chroma # remove old database files if any
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory="docs/youtube/chroma/"
)
print(vectordb._collection.count())
#Similarity search. Initial chechs
question = "What is the main topic of the text?"
sim1 = vectordb.similarity_search(question,k=3)
print(len(sim1))
Is the issue the embedding or the ChromaDB indexing?
Upvotes: 0
Views: 1865
Reputation: 163
The issue was due miscommunication regarding this post: https://github.com/imartinez/privateGPT/issues/1012
Pls do not comment line 73 as suggest in your C:\Users\phyln\AppData\Local\Programs\Python\Python311\Lib\site-packages\chromadb\segment\impl\manager\local.py file.
Follow the suggestion to remove conflicts where hnswlib shadows chroma-hnswlib.
Upvotes: 0