Reputation: 5150
For the following nodejs code below I am getting prompt_tokens = 24 in the response. I want to be able to determine what the expected prompt_tokens should be prior to making the request.
import { Configuration, OpenAIApi } from 'openai';
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
const completion = await openai.createChatCompletion({
model: "gpt-4",
messages: [
{role: "system", content: systemPrompt}, //systemPrompt= 'You are a useful assistant.'
{role: "user", content: userPrompt} //userPrompt= `What is the meaning of life?`
]
});
/* completion.data = {
id: 'chatcmpl-72Andnl250jsvSJGbjBJ6YzzFGToA',
object: 'chat.completion',
created: 1680752525,
model: 'gpt-4-0314',
usage: { prompt_tokens: 24, completion_tokens: 91, total_tokens: 115 },
choices: [ [Object] ]
} */
It seems like each model has its own way of encoding and the best lib for that is python tiktoken. Hence if I was to estimate "prompt_tokens". I would need to pass through the "text" value to the script below. However I am not sure what I should be using as the "text" below in the python script for the "messages" above in the nodejs, such that print(token_count) below = 24 [the actual prompt_tokens in the response]
import sys
import tiktoken
text = sys.argv[1]
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode(text)
token_count = len(tokens)
print(token_count)
Upvotes: 0
Views: 1719
Reputation: 6538
OpenAI recommends a JS library gpt3-encoder, after testing it gives similar results to their tokenizer playground.
Here is an example on how to use it:
import { encode, decode } from 'gpt-3-encoder'
const str = 'This is an example sentence to try encoding out on!'
const encoded = encode(str)
console.log('Encoded this string looks like: ', encoded)
Which gives the following result:
Encoded this string looks like: [
1212, 318, 281,
1672, 6827, 284,
1949, 21004, 503,
319, 0
]
If you want to find the correspondence from a token id to the string using
for (let token of encoded) {
console.log({ token, string: decode([token]) })
}
result
{ token: 1212, string: 'This' }
{ token: 318, string: ' is' }
{ token: 281, string: ' an' }
{ token: 1672, string: ' example' }
...
And finally, to turn token ids back, the sentence
const decoded = decode([array of token ids])
console.log('We can decode it back into:\n', decoded)
So for you, you can look a the size of the encoded
array, to get the number of token. encoded.length
Upvotes: -1