Bullzeye
Bullzeye

Reputation: 163

Similarity search number of requested results 3 is greater than number of elements in index 0,

I've been facing an issue for some time now and even though I read the ChromaDB documentation and tested different approaches I still am not able to resolve it.

I am getting the following error when I try to do a similarity search:

Number of requested results 3 is greater than number of elements in index 0,

Below is my script

import os
import openai
import sys
import pypdf

#set-up OPEN_AI API key 
openai.api_key = os.environ["OPENAI_API_KEY"] #a restart was needed after the variable was set through the terminal 

os.getcwd()

# Start with LangChain
# Import and use YouTube document loader

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

from langchain.document_loaders import PyPDFLoader

#start with one and then scale

url1="https://www.youtube.com/watch?v=wXj7Hzd8dOI" #Should you Change your job? J P Explains the risks of (not) quitting your job
url2="https://www.youtube.com/shorts/BnYK848GcAA" #How to handle emotional pain
url3="https://www.youtube.com/watch?v=wXj7Hzd8dOI" #https://www.youtube.com/shorts/4qMyHwmnQHk
save_dir="docs/youtube/"



loader = GenericLoader(
    YoutubeAudioLoader([url1],save_dir),
    OpenAIWhisperParser()
)
loader2 = GenericLoader(
    YoutubeAudioLoader([url2],save_dir),
    OpenAIWhisperParser()
)
loader3 = GenericLoader(
    YoutubeAudioLoader([url3],save_dir),
    OpenAIWhisperParser()
)

videos = []

videos.extend(loader.load())
videos.extend(loader2.load())
videos.extend(loader3.load())


print(len(videos))

#document splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 30
)

splits = text_splitter.split_documents(videos)

print(len(splits))
print(splits)

#embeddings
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

#persist_directory = 'chroma/'
#!rm -rf ./docs/chroma  # remove old database files if any


vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory="docs/youtube/chroma/"
)

print(vectordb._collection.count())

#Similarity search. Initial chechs

question = "What is the main topic of the text?"

sim1 = vectordb.similarity_search(question,k=3)

print(len(sim1))

Is the issue the embedding or the ChromaDB indexing?

Upvotes: 0

Views: 1865

Answers (1)

Bullzeye
Bullzeye

Reputation: 163

The issue was due miscommunication regarding this post: https://github.com/imartinez/privateGPT/issues/1012

Pls do not comment line 73 as suggest in your C:\Users\phyln\AppData\Local\Programs\Python\Python311\Lib\site-packages\chromadb\segment\impl\manager\local.py file.

Follow the suggestion to remove conflicts where hnswlib shadows chroma-hnswlib.

Upvotes: 0

Related Questions