Pavan Kumar
Pavan Kumar

Reputation: 115

How to ensure OpenAI realtime provides strict literal language translations with out extra details

Currently am working on the language translation between 2 callers using Twilio and open ai real time, using Twilio am fetching the audio stream and pushing the audio stream to openai websocket as below.

 const audioAppend = {
          type: "input_audio_buffer.append",
          audio: data.media.payload,
        };

        if (
          client.callerOpenAiSocket != null &&
          client.callerOpenAiSocket.readyState === WebSocket.OPEN
        ) {
          client.callerOpenAiSocket.send(JSON.stringify(audioAppend));
        } else {
          //console.log("Please wait until OpenAI is intialized");
        }

Coming to my open ai socket this is how I am sending the session update

this.callersessionUpdate = {
  type: "session.update",
  session: {
    turn_detection: {
      type: "server_vad",
      threshold: 0.5,
      prefix_padding_ms: 300,
      silence_duration_ms: 500,
    },
    input_audio_format: "g711_ulaw",
    output_audio_format: "g711_ulaw",
    voice: this.voice,
    instructions: this.callerPrompt,
    modalities: ["text", "audio"],
    temperature: 0.8,
    max_response_output_tokens: 100,
    input_audio_transcription: {
      model: "whisper-1",
    },
  },
};

And the prompt I used to make the language translation is as follows:

You are an AI assistant designed to process Telugu audio. Please perform the following tasks accurately and concisely:

  1. Task: Listen to the provided Telugu audio and transcribe it > into written Telugu text.
  2. Translate: Translate the transcribed Telugu text into English.
  3. Output: Provide English translation clearly.

Do not include any additional information, context, or explanations. Ensure that all responses are complete and clear.

Coming to the issues that I’m facing now are:

  1. Delay in the response from the open ai.
  2. During the conversation between the callers, sometimes, instead of translating the audio, it is getting into conversational mode, which is causing lot of confusions.
  3. Though I mentioned the specific source language in the prompt it is transcribing to some other languages, again this doesn’t happen all the time.
  4. In my scenario, am I expected to receive the below 2 events because at the moment I’m not receiving them sometimes and I suspect this could be one of the reason but I’m not sure though. "conversation.item.input_audio_transcription.completed" and "response.audio_transcript.done".

NOTE: I’m sending the session updates for 3 seconds. Can anyone guide me out in addressing the issues what I am facing.

Upvotes: 1

Views: 95

Answers (0)

Related Questions