Piotr Migdal
Piotr Migdal

Reputation: 12832

Do GPT-4 and GPT-3.5 share the same token encoder?

I want to know if I can use the same token counter for various GPT models - especially GPT-3, GPT-3.5, and GPT-4.

GPT models by OpenAI need texts to be tokenized (using Byte Pair Encoding, BPE), see Interactive GPT tokenizer. I haven't found a direct statement if they use the same or different tokenizers. Even this official OpenAI page says that:

If you need a programmatic interface for tokenizing text, check out our tiktoken package for Python. For JavaScript, the gpt-3-encoder package for node.js works for most GPT-3 models.

Upvotes: 1

Views: 1430

Answers (2)

Yohan
Yohan

Reputation: 1263

Here is a code sample that can help determine the exact type of encoder being used.

import tiktoken

model = "gpt-4o"
# Initialize the encoding for the specified model
encoding_name = tiktoken.encoding_for_model(model)

print(f'encoder name for model: {model} is: {encoding_name}')

Upvotes: 0

Piotr Migdal
Piotr Migdal

Reputation: 12832

Although I cannot find it in the official documentation, GPT-3.5 and GPT-4 seem to share cl100k_base encoding.

See e.g. gpt-tokenizer.

  • gpt-4-32k (cl100k_base)
  • gpt-4-0314 (cl100k_base)
  • gpt-4-32k-0314 (cl100k_base)
  • gpt-3.5-turbo (cl100k_base)
  • gpt-3.5-turbo-0301 (cl100k_base)

And How to count tokens with tiktoken - OpenAI cookbook.

Upvotes: 2

Related Questions