sc0urge
sc0urge

Reputation: 51

OpenAI Whisper API: "Invalid file format" on AWS Lambda

I am trying to do transcription on the voice of the user which is sent from the client as webm bytes with opus codecs. When running locally I have no issues in saving the bytes and reading them as file to use in the API. When running the same code on Lambda I get this error:

openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid file format. Supported formats: ['flac', 'm4a', 'mp3', 'mp4', 'mpeg', 'mpga', 'oga', 'ogg', 'wav', 'webm']", 'type': 'invalid_request_error', 'param': None, 'code': None}}.

This is my code:

@api_bp.post("/audio")
def transcribeRoute():
    audio_file = request.files.get('audio', None)

    if audio_file:
        print(audio_file) # <FileStorage: 'blob' ('audio/webm;codecs=opus')>


        old_work_dir = os.getcwd()

        with tempfile.TemporaryDirectory() as tmp_dir:
            os.chdir(tmp_dir)

            try:
                input_file_path = 'recording.webm'
                audio_file.save(input_file_path)

                audio_file = open(input_file_path, "rb")
                print(audio_file) # <_io.BufferedReader name='recording.webm'>

                transcript = client.audio.transcriptions.create(model="whisper-1",file=audio_file).text

            finally:
                os.remove(input_file_path)
                os.chdir(old_work_dir)

        print(f"Transcribed audio: {transcript}")
        return {"transcription": transcript}, 200

    print("No audio file found")
    return {"transcription": "error"}, 400

My first thought was that this is an issue with Lambda's temporary storage. However it seems that /tmp works and has 500mb of storage. Printing the file also confirms that it is saved and can be read.

Any help is greatly appreciated!

Edit 1 I now just use the raw bytes instead of saving it first to remove the possibility of it being an error in the filesystem.

@api_bp.post("/audio")
def transcribeRoute():
    audio_file = request.files.get('audio', None)

    if audio_file:
        print(audio_file) # <FileStorage: 'blob' ('audio/webm;codecs=opus')>

        buffer = BytesIO(audio_file.read())
        buffer.name = "test.webm"

        transcript = client.audio.transcriptions.create(model="whisper-1", file=buffer).text

        print(f"Transcribed audio: {transcript}")
        return {"transcription": transcript}, 200

    print("No audio file found")
    return {"transcription": "error"}, 400

I am still investigating this thread, which made me think, that maybe this is something with a dependency on the host machine. To find out I rehosted the code to Lambda with lambda Docker image which threw the same error. Running the program locally in the official python:3.9 image works with no problems. When deploying this to lambda again it starts throwing errors once again.

My current theory is that the binary-data somehow gets corrupted when sending it over the API-Gateway, since when I try to read that data and transform it into another format it starts breaking (only on Lambda, local is no problem). This is what I get: Output from ffmpeg/avlib:

2023-12-06T15:14:21.776+01:00   ffmpeg version 5.1.4-0+deb12u1 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 12 (Debian 12.2.0-14)
configuration: --prefix=/usr --extra-version=0+deb12u1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librist --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --disable-sndio --enable-libjxl --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-libplacebo --enable-librav1e --enable-shared

libavutil 57. 28.100 / 57. 28.100
libavcodec 59. 37.100 / 59. 37.100
libavformat 59. 27.100 / 59. 27.100
ibavdevice 59. 7.100 / 59. 7.100
libavfilter 8. 44.100 / 8. 44.100
libswscale 6. 7.100 / 6. 7.100
libswresample 4. 7.100 / 4. 7.100
libpostproc 56. 6.100 / 56. 6.100
[matroska,webm @ 0x55802d5b9e00] Element at 0x44 ending at 0x83 exceeds containing master element ending at 0x74
[matroska,webm @ 0x55802d5b9e00] EBML header parsing failed
pipe:: Invalid data found when processing input

Upvotes: 0

Views: 1491

Answers (1)

sc0urge
sc0urge

Reputation: 51

After much trying and researching the problem was a mix of 2 issues:
a) In order for the Whisper API to work, the buffer with the audio-bytes has to have a name (which happens automatically when you write and read it to the file, just make sure you have the right extension).
b) The AWS API-Gateway doesn't support binary data in requests by default, and you have to manually allow it. F.e. if you are using Serverless Framework for deployment add this:

provider:
  apiGateway:
    binaryMediaTypes:
      - '*/*'

Upvotes: 1

Related Questions