Reputation: 1
I have a large database of documents (these “documents” are essentially web pages and they are all in HTML). They have information regarding the business itself and can contain a lot of similar information. What I want to do is to create a chatbot on top of this database that can answer any question regarding the content of its documents.
Now, if I pass the correct information to GPT, it can answer the question easily. The main difficulty is how to get that relevant information, so that it can be provided to GPT. Right now, I’m chunking the documents, embedding those chunks, storing them in a vector database and, when a question is asked, I fetch the k-nearest embeddings from this vector database (I'm using text-embedding-ada-002 to create the embedding, by the way).
So, my questions are:
(I've also made the same questions in this OpenAI's Community Forum post)
Upvotes: 0
Views: 1338
Reputation: 1
Your seems to be a chunking problem. Your content is in HTML so is semistructured. It's really important to chunk text following the original layout and content semantics.
If you use a modern text splitter you will probably solve this problem. Maybe you are asking why...
The size of the chunk normally doesn't matter a lot, the key point is to split documents into pieces of text that are semantically relevant. As humans, we tend to do that naturally when releasing a visual document (HTML for example) and for this reason, the layout information are important.
Optimal chunks will increase your rag performances (both retrieval and LLM output) and create better embeddings that will mitigate the issue of highest score for short content.
I don't like to promote myself, but we are releasing that solution at https://preprocess.co s you don't have to.
Upvotes: 0
Reputation: 663
As you already know of the technical way of building a Retrieval Augmented Generation (RAG) system. I'm going to share some experience I made.
RAG works best if your data is as clean as possible. This sucks, as it's much work. If you're having a lot of html tags, this will add noise to your embeddings. Also if your documents contain a lot of similar data will give your retriever a hard time, as everything is similar.
There is a paper stating that RAG combined with a LLM with a large context window will work best. So you can send much more chunks and let the LLM do the rest. GPT4-turbo has a context size of 128k tokens. Compared to 4k for GPT3.5 this is a lot more. There are also open-source models with 200k tokens.
Paper: https://arxiv.org/abs/2310.03025
GPT4-turbo: https://help.openai.com/en/articles/8555510-gpt-4-turbo
Open-Source model with 200k context window: https://huggingface.co/01-ai/Yi-34B
Upvotes: 0