Transcribing Speaker Output Audio and Microphone Input Audio Real-Time using Azure AI Speech Service

Question

Imagine you are calling a friend through your computer. You are using a microphone, your friend talks to you and you hear them through your speakers.

I want to transcribe real-time the conversation that is happening. I don't know which platform you are using for the call, so I can't use anything platform specific. For this reason, the best way to do this problem is by transcribing the microphone input and the audio that is coming out of your speakers.

Unfortunately, I cannot figure out how to do both at once.

I want real-time (or close to real-time, 5 seconds delay is okay) transcription of a call between someone using a laptop and a person on the other side. I am using Azure AI Speech service to transcribe from ongoing audio streams. I can use this to also transcribe from file.

For the person on our side, I can use the microphone input audio. For the person on the other side, I can use speaker output audio. Only problem is, I don't know how to combine these two things.

This code works perfectly for transcribing from microphone real-time. I just don't know how to add in the other half of the conversation from the speaker audio.

def transcribe_from_microphone():
    
I found this:

import soundcard as

adityapgupta211 · Accepted Answer

This is a "sorta" answer, still not perfect.

Turns out, you can specify an input device for audio configs and use that for the transcriber: You need the "audio device endpoint ID string [...] from the IMMDevice object".

This took a while to find, but I came across this code (credit to @Damien) that finds exactly that string:

Unfortunately, it doesn't seem like I can use my headset speaker with Azure (just don't get any transcriptions), but I am able to use the Stereo Mix which after a lot of finicking does allow me to transcribe the speaker output. Additionally, the transcription is very slow and inaccurate.

I am attaching relevant code for reference:

Note: This is an altered version of my code, so it is possible that there is something that I accidentally removed that is in fact necessary, but the main idea is there. Feel free to ask questions if something doesn't work.

Didn't have to use Soundcard library or create any custom Audio Stream classes (which I found very painful). I'm gonna open up another question about the slowness, though I suspect that it is due to the threads rather than the Azure service itself.

Transcribing Speaker Output Audio and Microphone Input Audio Real-Time using Azure AI Speech Service

Answers (2)

Related Questions