Reputation: 12832
I want to know if I can use the same token counter for various GPT models - especially GPT-3, GPT-3.5, and GPT-4.
GPT models by OpenAI need texts to be tokenized (using Byte Pair Encoding, BPE), see Interactive GPT tokenizer. I haven't found a direct statement if they use the same or different tokenizers. Even this official OpenAI page says that:
If you need a programmatic interface for tokenizing text, check out our tiktoken package for Python. For JavaScript, the gpt-3-encoder package for node.js works for most GPT-3 models.
Upvotes: 1
Views: 1430
Reputation: 1263
Here is a code sample that can help determine the exact type of encoder being used.
import tiktoken
model = "gpt-4o"
# Initialize the encoding for the specified model
encoding_name = tiktoken.encoding_for_model(model)
print(f'encoder name for model: {model} is: {encoding_name}')
Upvotes: 0
Reputation: 12832
Although I cannot find it in the official documentation, GPT-3.5 and GPT-4 seem to share cl100k_base
encoding.
See e.g. gpt-tokenizer.
gpt-4-32k
(cl100k_base
)gpt-4-0314
(cl100k_base
)gpt-4-32k-0314
(cl100k_base
)gpt-3.5-turbo
(cl100k_base
)gpt-3.5-turbo-0301
(cl100k_base
)
And How to count tokens with tiktoken - OpenAI cookbook.
Upvotes: 2