Trouble with figuring out what Speech SDK AudioConfig to utilize with the audio/webm;codecs=opus Content-Type

Question

I'm trying to record audio in a frontend using MediaRecorder, then sending it to a controller as a blob, so that it can be transcribed with the Speech SDK. However, it seems that the Speech SDK still doesn't understand it, even when instantiating a PushAudioInputStream or PullAudioInputStream with the corresponding AudioStreamFormat with OGG_OPUS, as described in the Azure documentation for handling compressed input. GStreamer was set up as specified in the documentation, and the POST-request seems to send an OPUS header.

I've tried both reading the blob as a stream, as well as saving it as a file, then reading a stream from the newly saved file, yet the speech recognizer still can't seem to understand it. I can play the saved file locally with no issues, and I've also made sure to use the same binary data as a stream for both the PushAudioInputStream as well as saving it as a file, just to verify that the audio blob gets sent properly, and that I can play it back locally.

As of this example, I'm reading from a file that I save earlier from the blob. I had the same results as described with a reading as stream implementation.

using var customAudioStreamFormat = AudioStreamFormat.GetCompressedFormat(AudioStreamContainerFormat.OGG_OPUS);
SpeechRecognitionResult result;
byte[] debugAudioConfigStream;

using (var audioConfigStream = new PushAudioInputStream(customAudioStreamFormat))
{
    audioConfigStream.Write(File.ReadAllBytes("path of the file saved from reading the blob"));
    audioConfigStream.Close();
    debugAudioConfigStream = File.ReadAllBytes("path of the file saved from reading the blob");
    using (var audioConfig = AudioConfig.FromStreamInput(audioConfigStream))
    {
        using (var speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig))
        {
            result = await speechRecognizer.RecognizeOnceAsync();
        }
    }
//debugAudioConfigStream is later read and written to a new file... and it plays the audio perfectly

If the OGG_OPUS AudioStreamFormat is provided, the result gets cancelled with the reason "ServiceTimeout Timeout: no recognition result received SessionId: ...", and the result's duration is 00:00:00. And if the PushAudioInputStream constructor doesn't receive an AudioStreamFormat, then the result ends up with NoMatch, and a sensible duration instead. These results are reproducible with either input streams or reading from files.

Trying to open the saved files in Audacity, and setting the encoding as discovered by Audacity, or by playing around with a few settings that hit a 128k bitrate(which MediaRecorder uses), will always yield completely garbage audio. Yet despite this, Microsoft's Azure Speech-to-Text demo seems to handle these newly saved "bogus" files completely fine, and can transcribe them accurately. It even accepts an incorrectly saved as .wav file that's completely missing a RIFF header.

As such, I'm assuming that the encoding in my controller is the issue. Not providing an AudioStreamFormat makes the encoding incompatible with how the blob was encoded, as seen with the NoMatch result having an actual duration, yet when using the OGG_OPUS AudioStreamFormat, the recognition request always gets cancelled instantly. The "timeout" happens instantly. And for what it's worth, transcribing a properly formatted .wav file(properly recorded and exported in Audacity, not sent through the frontend) actually works, but ideally, I want to avoid having to save a file to the machine for every speech transcription.

Another thing to note is that audioConfig always throws a System.Application exception from its AudioProcessingOptions property, but I'm still able to transcribe properly formatted .wav files despite this occuring. I could see one solution being to try GetWaveFormatPCM, but I'm not sure what parameters to put in with the blob's encoding, as I had trouble finding out what is actually being used among the myriad compression options utilized in OPUS, or by browsers.

Trouble with figuring out what Speech SDK AudioConfig to utilize with the audio/webm;codecs=opus Content-Type

Answers (1)

Related Questions