Reputation: 6131
I am using LangChain to create embeddings and then ask a question to those embeddings like so:
embeddings: OpenAIEmbeddings = OpenAIEmbeddings(disallowed_special=())
db = DeepLake(
dataset_path=deeplake_url,
read_only=True,
embedding_function=embeddings,
)
retriever: VectorStoreRetriever = db.as_retriever()
model = ChatOpenAI(model_name="gpt-3.5-turbo")
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)
result = qa({"question": question, "chat_history": chat_history})
But I am getting the following error:
File "/xxxxx/openai/api_requestor.py", line 763, in _interpret_response_line
raise self.handle_error_response(
openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 13918 tokens. Please reduce the length of the messages.
The chat_history is empty and the question is quite small.
How can I reduce the size of tokens being passed to OpenAI?
I'm assuming the response from the embeddings is too large being passed to openai. It might be easy enough to just figure out how to truncate the data being sent to openai.
Upvotes: 3
Views: 20513
Reputation: 21
My two cents, the current explanation is not strictly true, it might just be working by accident.
max_tokens_limit applies specifically to the new tokens created by the model.
However, the limit of a model in tokens, is the sum of all tokens input and generated by the model. So if you happen to reduce the new tokens by enough you can clear the overall bar set by the model. However, it's feasible to still exceed the model's limit even if you set the max_tokens_limit = 0, if your input tokens are too great.
Your input tokens + max_tokens_limit <= model token limit.
Everyone will have a different approach, depending on which they prefer to prioritize. For example, if I'm using a 512 token model, I might aim for a max token output of around 200 token, so I will clip the input token length to 312.
This will completely depend on your model, task type, and use-case.
EDIT: if you use tokenizer directly, which doesn't seem to the be case, you can add a max_length limit to the tokenised input_ids. But I don't think this happens with Langchain - it's handled by the pipeline/chain.
Upvotes: 1
Reputation: 6131
When you initiate the ConversationalRetrievalChain
object, pass in a max_tokens_limit
amount.
qa = ConversationalRetrievalChain.from_llm(
model, retriever=retriever, max_tokens_limit=4000
)
This will automatically truncate the tokens when asking openai / your llm.
In the base.py
of ConversationalRetrievalChain
there is a function that is called when asking your question to deeplake/openai:
def _get_docs(self, question: str, inputs: Dict[str, Any]) -> List[Document]:
docs = self.retriever.get_relevant_documents(question)
return self._reduce_tokens_below_limit(docs)
Which reads from the deeplake
vector database, and adds that as context to your doc's text that you upload to openai.
The _reduce_tokens_below_limit
reads from the class instance variable max_tokens_limit
to truncate the size of the input docs.
Upvotes: 5