Reputation: 2230
I don't understand how I can read the bytestream for a TTS azure service in python - and relibaly play the stream
bool = can_read_data(requested_bytes: int, pos: int) and int = read_data(audio_buffer: bytes, pos: int | None = None)
so
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(subscription='key', region='uksouth')
speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Riff16Khz16BitMonoPcm)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
text = "Hello, world!"
# Synthesize the speech
result = speech_synthesizer.speak_text_async(text).get()
# Create an AudioDataStream from the synthesized result
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized for text [{}]".format(text))
audio_data_stream = speechsdk.AudioDataStream(result)
audio_data_stream.save_to_wav_file("output.wav")
# Reset the stream position to the beginning since saving to file puts the position to end.
audio_data_stream.position = 0
# Reads data from the stream
audio_buffer = bytes(16000)
total_size = 0
filled_size = audio_data_stream.read_data(audio_buffer)
while filled_size > 0:
print("{} bytes received.".format(filled_size))
total_size += filled_size
filled_size = audio_data_stream.read_data(audio_buffer)
print("Totally {} bytes received for text [{}].".format(total_size, text))
# Initialize playing
from pydub import AudioSegment
import io
audio_segment = AudioSegment(
data=audio_buffer, # The raw audio data you received
sample_width=2, # Bytes per sample
frame_rate=16000, # Sampling frequency
channels=1 # Mono
)
from pydub.playback import play
play(audio_segment)
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print("Error details: {}".format(cancellation_details.error_details))
Its streaming - and saving. But the stream doesnt sound right. What am I getting wrong?
Upvotes: 0
Views: 363
Reputation: 3649
I added the lines below to your code and was able to obtain the output stream, successfully saving the audio to the output.wav file.
from pydub.playback import play
audio_segment = AudioSegment(data=audio_buffer[:filled_size]
Code :
Here is the complete code:
import azure.cognitiveservices.speech as speechsdk
from pydub import AudioSegment
from pydub.playback import play
import io
speech_config = speechsdk.SpeechConfig(subscription='<speech_key>', region='<speech_region>')
speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Riff16Khz16BitMonoPcm)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
text = "Hello, world!"
result = speech_synthesizer.speak_text_async(text).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized for text [{}]".format(text))
audio_data_stream = speechsdk.AudioDataStream(result)
audio_data_stream.save_to_wav_file("output.wav")
audio_data_stream.position = 0
audio_buffer = bytes(16000)
total_size = 0
filled_size = audio_data_stream.read_data(audio_buffer)
while filled_size > 0:
print("{} bytes received.".format(filled_size))
total_size += filled_size
audio_segment = AudioSegment(data=audio_buffer[:filled_size], sample_width=2, frame_rate=16000, channels=1)
play(audio_segment)
filled_size = audio_data_stream.read_data(audio_buffer)
print("Totally {} bytes received for text [{}].".format(total_size, text))
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print("Error details: {}".format(cancellation_details.error_details))
Output :
The code above ran successfully, and I was able to hear the audio stream.
C:\Users\xxxxxxxx\Documents\xxxxxxxx>python app.py
Speech synthesized for text [Hello, world!]
16000 bytes received.
16000 bytes received.
12000 bytes received.
Totally 44000 bytes received for text [Hello, world!].
Upvotes: 0