Reputation: 77
I'm facing an issue with tracking token usage for Anthropic models using the TikToken library. The tiktoken
library natively supports OpenAI models, but I'm working with the Claude-3 model family from Anthropic.
When I use the Llama-Index
for chat completion, it returns the token count in the response(with anthropic models). However, when I create a query engine, it doesn't return the token counts.
Is there any way to get token counts in my query engine?
Here's my code for reference:
from llama_index.llms.anthropic import Anthropic
from llama_index.core import Settings
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-api03-****"
tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer
llm = Anthropic(model="claude-3-opus-20240229")
resp = llm.complete("Paul Graham is ")
def generate_response(question, db_name, collection, usecase_id, llm, master_prompt):
llm = Anthropic(model=llm, temperature=0.5)
tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
vector_store = get_vectordb(db_name, collection)
Settings.llm = llm
Settings.embed_model = embed_model
print("llm and embed_model set")
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
vector_retriever = index.as_retriever(
vector_store_query_mode="default",
similarity_top_k=5,
)
text_retriever = index.as_retriever(
vector_store_query_mode="sparse",
similarity_top_k=5,
)
retriever = QueryFusionRetriever(
[vector_retriever, text_retriever],
similarity_top_k=5,
num_queries=1,
mode="relative_score",
use_async=False,
)
response_synthesizer = CompactAndRefine()
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer,
)
query_engine = index.as_query_engine()
print("query_engine created")
return query_engine.query(question)
Any help would be greatly appreciated! Thanks!
Upvotes: 0
Views: 1105
Reputation: 11543
I have not tried this myself, but according to the documentation you should use TokenCountingHandler
.
from llama_index.callbacks import CallbackManager, TokenCountingHandler
# Setup the tokenizer and token counter
token_counter = TokenCountingHandler(tokenizer=tokenizer)
# Configure the callback_manager
Settings.callback_manager = CallbackManager([token_counter])
Then after querying the engine, you should be able to access token count like this:
print(
"Embedding Tokens: ",
token_counter.total_embedding_token_count,
"\n",
"LLM Prompt Tokens: ",
token_counter.prompt_llm_token_count,
"\n",
"LLM Completion Tokens: ",
token_counter.completion_llm_token_count,
"\n",
"Total LLM Token Count: ",
token_counter.total_llm_token_count,
"\n",
)
Upvotes: 0