Reading audio file and converting into text using Azure Speech services in python, but only the first sentence is converted into speech

Question

Below is the code,

import json
import os
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import azure.cognitiveservices.speech as speechsdk

def main(filename):
    container_name="test-container"
            print(filename)
    blob_service_client = BlobServiceClient.from_connection_string("DefaultEndpoint")
    container_client=blob_service_client.get_container_client(container_name)
    blob_client = container_client.get_blob_client(filename)
    with open(filename, "wb") as f:
        data = blob_client.download_blob()
        data.readinto(f)

    speech_key, service_region = "1234567", "eastus"
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

    audio_input = speechsdk.audio.AudioConfig(filename=filename)
    print("Audio Input:-",audio_input)
  
    speech_config.speech_recognition_language="en-US"
    speech_config.request_word_level_timestamps()
    speech_config.enable_dictation()
    speech_config.output_format = speechsdk.OutputFormat(1)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
    print("speech_recognizer:-",speech_recognizer)
    #result = speech_recognizer.recognize_once()
    all_results = []

    def handle_final_result(evt):
        all_results.append(evt.result.text)  
    done = False 

    def stop_cb(evt):
        #print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        global done
        done= True

    #Appends the recognized text to the all_results variable. 
    speech_recognizer.recognized.connect(handle_final_result) 
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    speech_recognizer.start_continuous_recognition()
    
    
    #while not done:
        #time.sleep(.5)
    
    print("Printing all results from speech to text:")
    print(all_results)


    
main(filename="test.wav")

Error while calling from main function,

test.wav
Audio Input:- 
speech_recognizer:- 
[]

Expected Output (Output without using main function)

test.wav
Audio Input:- 
speech_recognizer:- 
Printing all results from speech to text:
['hi', '', '', 'Uh.', 'A good laugh.', '1487', "OK, OK, I think that's enough.", '']

The existing code works perfectly if we do not use the main function, but when i call this using main function i do not get the desired output. Please guide us in the missing part.

Satya V · Accepted Answer

As described in the article here,recognize_once_async() (the method that you re using) - this method will only detect a recognized utterance from the input starting at the beginning of detected speech until the next pause.

From my understanding, your requirement would be to met if you make use of the start_continuous_recognition().The start function will start and continue processing all utterances until you invoke the stop function.

This method has lot of events connected to it, the "recognized" event fires when speech recognition process occurs. You need to have an event handler in place to handle recognition and extract the text. You could refer to the article here for more information.

Sharing a sample snippet that makes use of the start_continuous_recognition() to convert audio to text.

import azure.cognitiveservices.speech as speechsdk
import time
import datetime

# Creates an instance of a speech config with specified subscription key and service region.
# Replace with your own subscription key and region identifier from here: https://aka.ms/speech/sdkregion
speech_key, service_region = "YOURSUBSCRIPTIONKEY", "YOURREGION"
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

# Creates an audio configuration that points to an audio file.
# Replace with your own audio filename.
audio_filename = "sample.wav"
audio_input = speechsdk.audio.AudioConfig(filename=audio_filename)

# Creates a recognizer with the given settings
speech_config.speech_recognition_language="en-US"
speech_config.request_word_level_timestamps()
speech_config.enable_dictation()
speech_config.output_format = speechsdk.OutputFormat(1)

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

#result = speech_recognizer.recognize_once()
all_results = []



#https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.recognitionresult?view=azure-python
def handle_final_result(evt):
    all_results.append(evt.result.text) 
    
    
done = False

def stop_cb(evt):
    print('CLOSING on {}'.format(evt))
    speech_recognizer.stop_continuous_recognition()
    global done
    done= True

#Appends the recognized text to the all_results variable. 
speech_recognizer.recognized.connect(handle_final_result) 

#Connect callbacks to the events fired by the speech recognizer & displays the info/status
#Ref:https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.eventsignal?view=azure-python   
speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
# stop continuous recognition on either session stopped or canceled events
speech_recognizer.session_stopped.connect(stop_cb)
speech_recognizer.canceled.connect(stop_cb)

speech_recognizer.start_continuous_recognition()

while not done:
    time.sleep(.5)
    
print("Printing all results:")
print(all_results)

Sample output :

Calling the same through a function

Encapsulated in a function and tried calling it.

Just tweaked some more and encapsulated in a function. made sure the variable "done" is accessed non locally. Pls check and let me know

import azure.cognitiveservices.speech as speechsdk
import time
import datetime

def speech_to_text():
    
    # Creates an instance of a speech config with specified subscription key and service region.
    # Replace with your own subscription key and region identifier from here: https://aka.ms/speech/sdkregion
    speech_key, service_region = "<>", "<>"
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

    # Creates an audio configuration that points to an audio file.
    # Replace with your own audio filename.
    audio_filename = "whatstheweatherlike.wav"
    audio_input = speechsdk.audio.AudioConfig(filename=audio_filename)

    # Creates a recognizer with the given settings
    speech_config.speech_recognition_language="en-US"
    speech_config.request_word_level_timestamps()
    speech_config.enable_dictation()
    speech_config.output_format = speechsdk.OutputFormat(1)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

    #result = speech_recognizer.recognize_once()
    all_results = []



    #https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.recognitionresult?view=azure-python
    def handle_final_result(evt):
        all_results.append(evt.result.text) 
    
    
    done = False

    def stop_cb(evt):
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done= True

    #Appends the recognized text to the all_results variable. 
    speech_recognizer.recognized.connect(handle_final_result) 

    #Connect callbacks to the events fired by the speech recognizer & displays the info/status
    #Ref:https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.eventsignal?view=azure-python   
    speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    # stop continuous recognition on either session stopped or canceled events
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    speech_recognizer.start_continuous_recognition()

    while not done:
        time.sleep(.5)
            
    print("Printing all results:")
    print(all_results)

#calling the conversion through a function    
speech_to_text()

Reading audio file and converting into text using Azure Speech services in python, but only the first sentence is converted into speech

Answers (1)

Related Questions