Reputation: 41
Imagine you are calling a friend through your computer. You are using a microphone, your friend talks to you and you hear them through your speakers.
I want to transcribe real-time the conversation that is happening. I don't know which platform you are using for the call, so I can't use anything platform specific. For this reason, the best way to do this problem is by transcribing the microphone input and the audio that is coming out of your speakers.
Unfortunately, I cannot figure out how to do both at once.
I want real-time (or close to real-time, 5 seconds delay is okay) transcription of a call between someone using a laptop and a person on the other side. I am using Azure AI Speech service to transcribe from ongoing audio streams. I can use this to also transcribe from file.
For the person on our side, I can use the microphone input audio. For the person on the other side, I can use speaker output audio. Only problem is, I don't know how to combine these two things.
This code works perfectly for transcribing from microphone real-time. I just don't know how to add in the other half of the conversation from the speaker audio.
def transcribe_from_microphone():
I found this:
import soundcard as
Upvotes: 1
Views: 1098
Reputation: 41
This is a "sorta" answer, still not perfect.
Turns out, you can specify an input device for audio configs and use that for the transcriber: You need the "audio device endpoint ID string [...] from the IMMDevice object".
This took a while to find, but I came across this code (credit to @Damien) that finds exactly that string:
Unfortunately, it doesn't seem like I can use my headset speaker with Azure (just don't get any transcriptions), but I am able to use the Stereo Mix which after a lot of finicking does allow me to transcribe the speaker output. Additionally, the transcription is very slow and inaccurate.
I am attaching relevant code for reference:
Note: This is an altered version of my code, so it is possible that there is something that I accidentally removed that is in fact necessary, but the main idea is there. Feel free to ask questions if something doesn't work.
Didn't have to use Soundcard library or create any custom Audio Stream classes (which I found very painful). I'm gonna open up another question about the slowness, though I suspect that it is due to the threads rather than the Azure service itself.
Upvotes: 1
Reputation: 3332
Firstly, Try to Capture the microphone input and speaker output simultaneously. Stream both audio inputs to the Azure Speech service for transcription.
Later, Combine the transcriptions from both sources to provide a unified transcription output.
soundcard
library to capture audio from the microphone and speaker.Use Azure's Speech SDK to create two separate recognizers: one for the microphone input and one for the speaker output.
App.py:
import threading
import soundcard as sc
import azure.cognitiveservices.speech as speechsdk
import queue
import warnings
# Suppress the SoundcardRuntimeWarning
warnings.filterwarnings("ignore", category=sc.SoundcardRuntimeWarning)
# Your Azure subscription key and region
audio_key = 'YOUR_AZURE_SUBSCRIPTION_KEY'
audio_region = 'YOUR_AZURE_REGION'
# Queue for passing audio data between threads
mic_queue = queue.Queue()
speaker_queue = queue.Queue()
# Function to capture microphone audio
def capture_mic_audio(mic_queue):
mic = sc.get_microphone(id=str(sc.default_microphone().name))
with mic.recorder(samplerate=48000) as mic_recorder:
while True:
data = mic_recorder.record(numframes=1024)
mic_queue.put(data)
# Function to capture speaker audio
def capture_speaker_audio(speaker_queue):
speaker = sc.get_microphone(id=str(sc.default_speaker().name), include_loopback=True)
with speaker.recorder(samplerate=48000) as speaker_recorder:
while True:
data = speaker_recorder.record(numframes=1024)
speaker_queue.put(data)
# Function to create an audio input stream for Azure Speech SDK
def create_audio_input_stream(audio_queue):
class AudioInputStream(speechsdk.audio.PullAudioInputStreamCallback):
def __init__(self):
super().__init__()
def read(self, buffer, size):
try:
data = audio_queue.get(block=False)
buffer[:len(data)] = data
return len(data)
except queue.Empty:
return 0
def close(self):
pass
return speechsdk.audio.AudioConfig(stream=speechsdk.audio.PullAudioInputStream(AudioInputStream()))
# Function to start Azure speech recognition
def start_recognition(audio_config, speech_config, name):
recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
def recognized(evt):
if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
print(f"{name} recognized: {evt.result.text}")
elif evt.result.reason == speechsdk.ResultReason.NoMatch:
print(f"{name} recognized: No speech could be recognized")
elif evt.result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = evt.result.cancellation_details
print(f"{name} recognized: Canceled: {cancellation_details.reason}")
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print(f"Error details: {cancellation_details.error_details}")
recognizer.recognized.connect(recognized)
recognizer.start_continuous_recognition_async()
return recognizer
# Main function
def main():
speech_config = speechsdk.SpeechConfig(subscription=audio_key, region=audio_region)
speech_config.speech_recognition_language = "en-US"
# Start capturing audio
threading.Thread(target=capture_mic_audio, args=(mic_queue,), daemon=True).start()
threading.Thread(target=capture_speaker_audio, args=(speaker_queue,), daemon=True).start()
# Create audio input streams
mic_audio_input = create_audio_input_stream(mic_queue)
speaker_audio_input = create_audio_input_stream(speaker_queue)
# Start speech recognizers
mic_recognizer = start_recognition(mic_audio_input, speech_config, "Microphone")
speaker_recognizer = start_recognition(speaker_audio_input, speech_config, "Speaker")
# Keep the program running to process transcriptions
try:
while True:
pass
except KeyboardInterrupt:
mic_recognizer.stop_continuous_recognition_async()
speaker_recognizer.stop_continuous_recognition_async()
if __name__ == "__main__":
main()
As you can see in the above, I have used two separate threads to capture audio from the microphone and speaker. Each thread captures audio continuously and puts it into a queue.
Also create a custom PullAudioInputStreamCallback
to feed audio data from the queue to the Azure Speech SDK.
Result:
Modified:
Converting NumPy array to bytes.
def read(self, buffer: memoryview) -> int:
try:
data = audio_queue.get(block=False)
buffer[:len(data)] = data.tobytes() # Convert to bytes
return len(data)
except queue.Empty:
return 0
Upvotes: 1