Khoi Nguyen
Khoi Nguyen

Reputation: 421

Azure GPT-4-Turbo JSON mode response generation breaks after 1024 tokens

Overview

Continuing this post.

I'm running a data extraction tasks on documents and I'm trying to take advantage of the 128k context window that gpt-4-turbo offers as well as the json-mode setting. I'm experiencing a bug where the generation breaks at token length 1024 during generation. Afaik, the output length limit should be 4096. Even then, it looks like the bugs are similar to an issue that the docs are mentioning, but the existing solutions / recommendations are insufficient and don't explain this specific observation I'm making.

Arbitrary Code Note

As I was writing this post, I was creating code to reproduce this error arbitrarily and got an "error message" I never got before during my data extraction tasks.

{
    "status": "failed",
    "reason": "The current configuration of the AI model does not support generating a response that exceeds 1024 tokens. Counting to 1500 in JSON would go beyond this token limit and is therefore not possible within a single response."
}

This is NOT a real error messages from the endpoint. It's the GPT generated response from the chat completion. This is confusing me even more but leading me to believe there's some config setting that's disabled or not working as intended, or the off chance that OpenAI as a whole forgot to enable 4096 output token length.

Describe the bug

Infinite Stream of Blank Characters

  • When using JSON mode, always instruct the model to produce JSON via some message in the conversation, for example via your system message. If you don't include an explicit instruction to generate JSON, the model may generate an unending stream of whitespace and the request may run continually until it reaches the token limit. To help ensure you don't forget, the API will throw an error if the string "JSON" does not appear somewhere in the context.

My prompts do include the JSON mode instruction, and even then, I observe that when the response generation reaches token length 1024, it'll begin generating an infinite stream of blank characters until content_filter is triggered. Below, I've stripped the blank space characters to indicate the token length.

Stream of Whitespace bug

Premature Stop and Malformed JSON Response

  • The JSON in the message the model returns may be partial (i.e. cut off) if finish_reason is length, which indicates the generation exceeded max_tokens or the conversation exceeded the token limit. To guard against this, check finish_reason before parsing the response.

The response returned was a result of the length stop condition. I noted that the token length of the returned response was 1058 tokens; however, on further inspection I actually saw that the JSON response was malformed (beyond being incomplete) and it began at exactly where the token length reached 1024!

Screenshot with malformed response Screenshot without malformed response

Code to Reproduce

Just create some messages and make a chat completion that'll likely generate a response token_length > 1024. Here's some code below to get started

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2023-05-15",
)

initial_messages = [
    {
        "role": self.SYSTEM,
        "content": "You are an AI Assistant. You will follow the user instruction. You will write a JSON response.",
    },
    {
        "role": self.USER,
        "content": "I want you to use GPT to generate the counting to 1500. Do not create code. Respond in JSON only. I do not care about reasonableness. You will count to 1500. You will write each number individually, and your output length max is 4096 tokens.",
    },
]
  
response_stream = client.chat.completions(
    model="gpt-4-turbo",
    response_format={"type": "json_object"},
    messages=initial_messages,
    max_tokens=4096,
    stream=True,
)
response_from_stream = ""
  
blank_space_threshold = 125
for chat_completion_chunk in response_stream:
    if (
        hasattr(chat_completion_chunk, "choices")
        and chat_completion_chunk.choices
    ):
        if (
            hasattr(chat_completion_chunk.choices[0], "delta")
            and chat_completion_chunk.choices[0].delta
            and hasattr(
                chat_completion_chunk.choices[0].delta, "content"
            )
        ):
            content = chat_completion_chunk.choices[0].delta.
            if content:
                response_from_stream += content

                if not content.isspace():
                    consecutive_blank_space_count = 0
                else:
                    consecutive_blank_space_count += len(content)
                    if (
                        consecutive_blank_space_count
                        > blank_space_threshold
                    ):
                        raise Exception(
                            "Encountered blank space bug."
                            + "\n"
                            + "Response from Stream Token Length: "
                            + str(
                                self.num_tokens_from_response(
                                    response_from_stream,
                                    model=self.chatgpt_model,
                                )
                            )
                            + "\n"
                            + "Response from Stream: "
                            + response_from_stream
                        )

Troubleshooting

Confirm max_tokens and model deployment

Can confirm this runs, and increasing max_tokens beyond 4096 will cause a model error as expected.

response_stream = client.chat.completions(
    model="gpt-4-turbo",
    response_format={"type": "json_object"},
    messages=initial_messages,
    max_tokens=4096,
    stream=True,
)

Deployment settings confirmed

Test random settings and prompts

Tried different prompts, temperature values, etc. Included more and less "JSON" instructions. Never generated more than 1024 tokens.

Upvotes: 1

Views: 1054

Answers (1)

JayashankarGS
JayashankarGS

Reputation: 7985

I have tried with 2023-12-01-preview API version without stream, and I got around 3100 tokens. With API version 2024-02-15-preview, I got around 4100 tokens.

client = AzureOpenAI(
    azure_endpoint="https://name.openai.azure.com/",
    api_key="bbxxxxxxxxxxxxxxxxxxxxxxxxxx",
    api_version="2024-02-15-preview",
)

initial_messages = [
    {
        "role": "system",
        "content": "You are an AI Assistant. You will follow the user instruction. You will write a JSON response.",
    },
    {
        "role": "user",
        "content": "I want you to use GPT to generate the counting to 1500. Do not create code. Respond in JSON only. I do not care about reasonableness. You will count to 1500. You will write each number individually, and your output length max is 4096 tokens.",
    },
]
  
response_stream = client.chat.completions.create(
    model="gpt4",
    response_format={"type": "json_object"},
    messages=initial_messages,
    max_tokens=4096
)

Enter image description here

If you observe the finish_reason, it is length. That means the model has given partial output because of the length limit.

Enter image description here

If you are getting finish_reason as stop, it means the model has given the full output. According to this documentation:

there's no guarantee for output to match a specific schema, even if requested in the prompt.

Try using the latest version of the API and follow this documentation for Managing conversations within the token limit.

Upvotes: 0

Related Questions