Reputation: 421
Continuing this post.
I'm running a data extraction tasks on documents and I'm trying to take advantage of the 128k
context window that gpt-4-turbo
offers as well as the json-mode
setting. I'm experiencing a bug where the generation breaks at token length 1024
during generation. Afaik, the output length limit should be 4096
. Even then, it looks like the bugs are similar to an issue that the docs are mentioning, but the existing solutions / recommendations are insufficient and don't explain this specific observation I'm making.
As I was writing this post, I was creating code to reproduce this error arbitrarily and got an "error message" I never got before during my data extraction tasks.
{
"status": "failed",
"reason": "The current configuration of the AI model does not support generating a response that exceeds 1024 tokens. Counting to 1500 in JSON would go beyond this token limit and is therefore not possible within a single response."
}
This is NOT a real error messages from the endpoint. It's the GPT generated response from the chat completion. This is confusing me even more but leading me to believe there's some config setting that's disabled or not working as intended, or the off chance that OpenAI as a whole forgot to enable 4096
output token length.
- When using JSON mode, always instruct the model to produce JSON via some message in the conversation, for example via your system message. If you don't include an explicit instruction to generate JSON, the model may generate an unending stream of whitespace and the request may run continually until it reaches the token limit. To help ensure you don't forget, the API will throw an error if the string
"JSON"
does not appear somewhere in the context.
My prompts do include the JSON mode instruction, and even then, I observe that when the response generation reaches token length 1024
, it'll begin generating an infinite stream of blank characters until content_filter
is triggered. Below, I've stripped the blank space characters to indicate the token length.
- The JSON in the message the model returns may be partial (i.e. cut off) if
finish_reason
islength
, which indicates the generation exceededmax_tokens
or the conversation exceeded the token limit. To guard against this, checkfinish_reason
before parsing the response.
The response returned was a result of the length
stop condition. I noted that the token length of the returned response was 1058
tokens; however, on further inspection I actually saw that the JSON response was malformed (beyond being incomplete) and it began at exactly where the token length reached 1024
!
Just create some messages and make a chat completion that'll likely generate a response token_length > 1024
. Here's some code below to get started
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version="2023-05-15",
)
initial_messages = [
{
"role": self.SYSTEM,
"content": "You are an AI Assistant. You will follow the user instruction. You will write a JSON response.",
},
{
"role": self.USER,
"content": "I want you to use GPT to generate the counting to 1500. Do not create code. Respond in JSON only. I do not care about reasonableness. You will count to 1500. You will write each number individually, and your output length max is 4096 tokens.",
},
]
response_stream = client.chat.completions(
model="gpt-4-turbo",
response_format={"type": "json_object"},
messages=initial_messages,
max_tokens=4096,
stream=True,
)
response_from_stream = ""
blank_space_threshold = 125
for chat_completion_chunk in response_stream:
if (
hasattr(chat_completion_chunk, "choices")
and chat_completion_chunk.choices
):
if (
hasattr(chat_completion_chunk.choices[0], "delta")
and chat_completion_chunk.choices[0].delta
and hasattr(
chat_completion_chunk.choices[0].delta, "content"
)
):
content = chat_completion_chunk.choices[0].delta.
if content:
response_from_stream += content
if not content.isspace():
consecutive_blank_space_count = 0
else:
consecutive_blank_space_count += len(content)
if (
consecutive_blank_space_count
> blank_space_threshold
):
raise Exception(
"Encountered blank space bug."
+ "\n"
+ "Response from Stream Token Length: "
+ str(
self.num_tokens_from_response(
response_from_stream,
model=self.chatgpt_model,
)
)
+ "\n"
+ "Response from Stream: "
+ response_from_stream
)
max_tokens
and model deploymentCan confirm this runs, and increasing max_tokens
beyond 4096
will cause a model error as expected.
response_stream = client.chat.completions(
model="gpt-4-turbo",
response_format={"type": "json_object"},
messages=initial_messages,
max_tokens=4096,
stream=True,
)
Tried different prompts, temperature values, etc. Included more and less "JSON" instructions. Never generated more than 1024 tokens.
Upvotes: 1
Views: 1054
Reputation: 7985
I have tried with 2023-12-01-preview
API version without stream, and I got around 3100
tokens. With API version 2024-02-15-preview
, I got around 4100
tokens.
client = AzureOpenAI(
azure_endpoint="https://name.openai.azure.com/",
api_key="bbxxxxxxxxxxxxxxxxxxxxxxxxxx",
api_version="2024-02-15-preview",
)
initial_messages = [
{
"role": "system",
"content": "You are an AI Assistant. You will follow the user instruction. You will write a JSON response.",
},
{
"role": "user",
"content": "I want you to use GPT to generate the counting to 1500. Do not create code. Respond in JSON only. I do not care about reasonableness. You will count to 1500. You will write each number individually, and your output length max is 4096 tokens.",
},
]
response_stream = client.chat.completions.create(
model="gpt4",
response_format={"type": "json_object"},
messages=initial_messages,
max_tokens=4096
)
If you observe the finish_reason
, it is length
. That means the model has given partial output because of the length limit.
If you are getting finish_reason
as stop
, it means the model has given the full output. According to this documentation:
there's no guarantee for output to match a specific schema, even if requested in the prompt.
Try using the latest version of the API and follow this documentation for Managing conversations within the token limit.
Upvotes: 0