Abdullah Bilal
Abdullah Bilal

Reputation: 393

Simple chatbot returning results without spaces

I created a LLM chatbot where it answers questions from a book on deep learning. I've managed to run everything but when I ask it any similarity search, the result it returns does not have any spaces.

nizedasacrucialtechnologythoughthefirstexperimentswithartificialneuralnetworkswereconductedinthe1950s.Deeplearninghasbeensuccessfullyusedincommercialapplicationssincethe1990s,butwasoftenregardedasbeingmoreofanartthanatechnologyandsomethingthatonlyanexpertcoulduse,untilrecently.Itistruethatsomeskillisrequiredtogetgoodperformancefromadeeplearningalgorithm.Fortunately,theamountofskillrequiredreducesastheamountoftrainingdataincreases.Thelearningalgorithmsreachinghumanperformanceoncomplextaskstodayarenearlyidenticaltothelearningalgorithmsthatstruggledtosolvetoyproblemsinthe1980s,thoughthemodelswetrainwiththesealgorithmshaveundergonechangesthatsimplifythetrainingofverydeeparchitectures.Themostimportantnewdevelopmentisthattodaywecanprovidethesealgorithmswiththeresourcestheyneedtosucceed.Figureshowshowthesizeofbenchmark1.8datasetshasincreasedremarkablyovertime.Thistrendisdrivenbytheincreasingdigitizationofsociety.Asmoreandmoreofouractivitiestakeplaceoncomputers,moreandmoreofwhatwedoisrecorded.Asourcomputersareincreasinglynetworkedtogether,itbecomeseasiertocentralizetheserecordsandcuratethem19

This is one example of the answers it is returning. If it helps, I'm using the TokenTextSplitter for splitting the documents although I've also tried RecursiveCharacterTextSplitter. I'm using text-embedding-3-large for generating text embeddings and Chroma as the vector store.

Reproducible Example:

import os
from openai import OpenAI
import chromadb
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import TokenTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import uuid

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_ORGANISATION = os.getenv('OPENAI_ORGANISATION')
MODEL = os.getenv('MODEL')
CHROMA_HOST = os.getenv('CHROMA_HOST')

def connect_openai_api(organisation: str):
    client = OpenAI(
        organization=organisation
    )
    return client

def get_or_create_db(host: str, port: int):
    chroma_client = chromadb.HttpClient(host=host, port=port)
    collection = chroma_client.get_or_create_collection(name="deep_learning_chatbot")
    return collection, chroma_client

def pdf_loader(loaders: list):
    docs = []
    for loader in loaders:
        docs.extend(loader.load())
    return docs

def document_splitter(docs: list):
    text_splitter = TokenTextSplitter()
    splits = text_splitter.split_documents(docs)
    return splits

def embeddings_loader(splits: list, directory: str):
    embedding = OpenAIEmbeddings(model="text-embedding-3-large")
    chroma_client = chromadb.Client()
    vectordb =  Chroma.from_documents(
        documents=splits,
        embedding=embedding,
        persist_directory=directory
    )
    return vectordb

def run():
    client = connect_openai_api(organisation=OPENAI_ORGANISATION, model=MODEL)
    chroma_collection, client = get_or_create_db(host='localhost', port=8000)

    loaders = [
        PyPDFLoader("../docs/Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville (z-lib.org).pdf"),
    ]

    docs = pdf_loader(loaders=loaders)
    splits = document_splitter(docs=docs)
    persist_directory = "vectorDB"
    vectordb = embeddings_loader(splits=splits, directory=persist_directory)

    question = "What is deep learning?"

    docs = vectordb.similarity_search(question,k=3)

    print(docs[0].page_content)
    print(docs[1].page_content)
    print(docs[2].page_content)

Upvotes: 0

Views: 41

Answers (0)

Related Questions