Reputation: 11
I'm trying to make a backend where I can upload an audio file and use the whisper ai to transcribe it but transcribe
accepts type np.ndarray and the audio files are in bytes I'm not sure how can I convert bytes -> ndarray.
I'm using postman to send an audio file to this backend, but I will need to convert bytes to ndarray to use the transcribe
method of whisper ai but I'm not sure how I can do it.
import numpy as np
import whisper
from typing import Annotated
from fastapi import FastAPI, File
app = FastAPI()
@app.post("/abcd")
async def transcribe_audio(audio_file_upload: Annotated[bytes, File()]):
model = whisper.load_model("base")
result = model.transcribe(audio_file_upload, word_timestamps=True, fp16=True)
return {"transcription": result}
Error
TypeError: expected np.ndarray (got bytes)
I tried using
audio_data = np.frombuffer(audio_file_upload, dtype=np.float32)
but got
ValueError: Expected parameter logits (Tensor of shape (1, 51865)) of distribution Categorical(logits: torch.Size([1, 51865])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]])
but I'm begginer using numpy etc and so I'm not sure how I can implement it?
Upvotes: 1
Views: 1322
Reputation: 11
Using the below line we can convert the audio bytes to ndarray.
aud_array = np.frombuffer(audio_file_upload, np.int8).flatten().astype(np.float32) / 32768.0
async def transcribe_audio(audio_file_upload: Annotated[bytes, File()]):
aud_array = np.frombuffer(audio_file_upload, np.int8).flatten().astype(np.float32) / 32768.0
model = whisper.load_model("base")
result = model.transcribe(aud_array, word_timestamps=True, fp16=True)
return {"transcription": result}
Credit: https://github.com/openai/whisper/discussions/216#discussioncomment-3779531
Upvotes: 0