Mohil
Mohil

Reputation: 77

How to Track Token Usage with TikToken Library for Anthropic Models in llama-index Query Engine?

I'm facing an issue with tracking token usage for Anthropic models using the TikToken library. The tiktoken library natively supports OpenAI models, but I'm working with the Claude-3 model family from Anthropic.

When I use the Llama-Index for chat completion, it returns the token count in the response(with anthropic models). However, when I create a query engine, it doesn't return the token counts.

Is there any way to get token counts in my query engine?

Here's my code for reference:

Chat Completion:

from llama_index.llms.anthropic import Anthropic
from llama_index.core import Settings
import os

os.environ["ANTHROPIC_API_KEY"] = "sk-ant-api03-****"
tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer
llm = Anthropic(model="claude-3-opus-20240229")

resp = llm.complete("Paul Graham is ")

Query Engine:

def generate_response(question, db_name, collection, usecase_id, llm, master_prompt):
    llm = Anthropic(model=llm, temperature=0.5)
    tokenizer = Anthropic().tokenizer
    Settings.tokenizer = tokenizer
    embed_model = OpenAIEmbedding(model="text-embedding-3-small")
    vector_store = get_vectordb(db_name, collection)
    Settings.llm = llm
    Settings.embed_model = embed_model
    print("llm and embed_model set")
    index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
    vector_retriever = index.as_retriever(
        vector_store_query_mode="default",
        similarity_top_k=5,
    )
    text_retriever = index.as_retriever(
        vector_store_query_mode="sparse",
        similarity_top_k=5,
    )
    retriever = QueryFusionRetriever(
        [vector_retriever, text_retriever],
        similarity_top_k=5,
        num_queries=1,
        mode="relative_score",
        use_async=False,
    )

    response_synthesizer = CompactAndRefine()

    query_engine = RetrieverQueryEngine(
        retriever=retriever,
        response_synthesizer=response_synthesizer,
    )
        
    query_engine = index.as_query_engine()
    
    print("query_engine created")
    
    return query_engine.query(question)

Any help would be greatly appreciated! Thanks!


Upvotes: 0

Views: 1105

Answers (1)

pasine
pasine

Reputation: 11543

I have not tried this myself, but according to the documentation you should use TokenCountingHandler.

from llama_index.callbacks import CallbackManager, TokenCountingHandler

# Setup the tokenizer and token counter
token_counter = TokenCountingHandler(tokenizer=tokenizer)

# Configure the callback_manager
Settings.callback_manager = CallbackManager([token_counter])

Then after querying the engine, you should be able to access token count like this:

print(
    "Embedding Tokens: ",
    token_counter.total_embedding_token_count,
    "\n",
    "LLM Prompt Tokens: ",
    token_counter.prompt_llm_token_count,
    "\n",
    "LLM Completion Tokens: ",
    token_counter.completion_llm_token_count,
    "\n",
    "Total LLM Token Count: ",
    token_counter.total_llm_token_count,
    "\n",
)

Upvotes: 0

Related Questions