Reputation: 393
I created a LLM chatbot where it answers questions from a book on deep learning. I've managed to run everything but when I ask it any similarity search, the result it returns does not have any spaces.
nizedasacrucialtechnologythoughthefirstexperimentswithartificialneuralnetworkswereconductedinthe1950s.Deeplearninghasbeensuccessfullyusedincommercialapplicationssincethe1990s,butwasoftenregardedasbeingmoreofanartthanatechnologyandsomethingthatonlyanexpertcoulduse,untilrecently.Itistruethatsomeskillisrequiredtogetgoodperformancefromadeeplearningalgorithm.Fortunately,theamountofskillrequiredreducesastheamountoftrainingdataincreases.Thelearningalgorithmsreachinghumanperformanceoncomplextaskstodayarenearlyidenticaltothelearningalgorithmsthatstruggledtosolvetoyproblemsinthe1980s,thoughthemodelswetrainwiththesealgorithmshaveundergonechangesthatsimplifythetrainingofverydeeparchitectures.Themostimportantnewdevelopmentisthattodaywecanprovidethesealgorithmswiththeresourcestheyneedtosucceed.Figureshowshowthesizeofbenchmark1.8datasetshasincreasedremarkablyovertime.Thistrendisdrivenbytheincreasingdigitizationofsociety.Asmoreandmoreofouractivitiestakeplaceoncomputers,moreandmoreofwhatwedoisrecorded.Asourcomputersareincreasinglynetworkedtogether,itbecomeseasiertocentralizetheserecordsandcuratethem19
This is one example of the answers it is returning. If it helps, I'm using the TokenTextSplitter for splitting the documents although I've also tried RecursiveCharacterTextSplitter. I'm using text-embedding-3-large for generating text embeddings and Chroma as the vector store.
Reproducible Example:
import os
from openai import OpenAI
import chromadb
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import TokenTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import uuid
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_ORGANISATION = os.getenv('OPENAI_ORGANISATION')
MODEL = os.getenv('MODEL')
CHROMA_HOST = os.getenv('CHROMA_HOST')
def connect_openai_api(organisation: str):
client = OpenAI(
organization=organisation
)
return client
def get_or_create_db(host: str, port: int):
chroma_client = chromadb.HttpClient(host=host, port=port)
collection = chroma_client.get_or_create_collection(name="deep_learning_chatbot")
return collection, chroma_client
def pdf_loader(loaders: list):
docs = []
for loader in loaders:
docs.extend(loader.load())
return docs
def document_splitter(docs: list):
text_splitter = TokenTextSplitter()
splits = text_splitter.split_documents(docs)
return splits
def embeddings_loader(splits: list, directory: str):
embedding = OpenAIEmbeddings(model="text-embedding-3-large")
chroma_client = chromadb.Client()
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory=directory
)
return vectordb
def run():
client = connect_openai_api(organisation=OPENAI_ORGANISATION, model=MODEL)
chroma_collection, client = get_or_create_db(host='localhost', port=8000)
loaders = [
PyPDFLoader("../docs/Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville (z-lib.org).pdf"),
]
docs = pdf_loader(loaders=loaders)
splits = document_splitter(docs=docs)
persist_directory = "vectorDB"
vectordb = embeddings_loader(splits=splits, directory=persist_directory)
question = "What is deep learning?"
docs = vectordb.similarity_search(question,k=3)
print(docs[0].page_content)
print(docs[1].page_content)
print(docs[2].page_content)
Upvotes: 0
Views: 41