Shivansh Yadav
Shivansh Yadav

Reputation: 11

How to convert any Audio file to np.ndarray for OpenAI Whisper

I'm trying to make a backend where I can upload an audio file and use the whisper ai to transcribe it but transcribe accepts type np.ndarray and the audio files are in bytes I'm not sure how can I convert bytes -> ndarray.

I'm using postman to send an audio file to this backend, but I will need to convert bytes to ndarray to use the transcribe method of whisper ai but I'm not sure how I can do it.

import numpy as np
import whisper
from typing import Annotated
from fastapi import FastAPI, File

app = FastAPI()

@app.post("/abcd")
async def transcribe_audio(audio_file_upload: Annotated[bytes, File()]):
    model = whisper.load_model("base")
    result = model.transcribe(audio_file_upload, word_timestamps=True, fp16=True)
    return {"transcription": result}


Error

TypeError: expected np.ndarray (got bytes)

I tried using

audio_data = np.frombuffer(audio_file_upload, dtype=np.float32)

but got

ValueError: Expected parameter logits (Tensor of shape (1, 51865)) of distribution Categorical(logits: torch.Size([1, 51865])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan]])

but I'm begginer using numpy etc and so I'm not sure how I can implement it?

Upvotes: 1

Views: 1322

Answers (1)

Shivansh Yadav
Shivansh Yadav

Reputation: 11

Using the below line we can convert the audio bytes to ndarray.

aud_array = np.frombuffer(audio_file_upload, np.int8).flatten().astype(np.float32) / 32768.0
async def transcribe_audio(audio_file_upload: Annotated[bytes, File()]):
    aud_array = np.frombuffer(audio_file_upload, np.int8).flatten().astype(np.float32) / 32768.0
    model = whisper.load_model("base")
    result = model.transcribe(aud_array, word_timestamps=True, fp16=True)
    return {"transcription": result}

Credit: https://github.com/openai/whisper/discussions/216#discussioncomment-3779531

Upvotes: 0

Related Questions