Richard
Richard

Reputation: 389

Azure Speech Services + Text to Speech + Silent SpeakSsmlAsync

I am using Azure Text to Speech, part of the Cognitive Services.

I compose my request as SSML, and then call the function SpeakSsmlAsync.

If I choose the output format Audio24Khz160KBitRateMonoMp3, the function returns almost immediately with the speech data. But if I choose the output format Riff24Khz16BitMonoPcm, the functions plays the speech back through my speakers before returning with the speech data.

Is there a way to call Riff24Khz16BitMonoPcm silently, so that the speech data is returned but without hearing it first?

+++

Update 3rd August 2024, here is the code:

//
SpeechConfig speechConfig = SpeechConfig.FromSubscription(SubscriptionKey, SubscriptionRegion);
speechConfig.OutputFormat = OutputFormat.Detailed;
speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);
speechConfig.SpeechSynthesisVoiceName = "de-DE-KatjaNeural";
speechConfig.SetProperty(PropertyId.Speech_LogFilename, LogServices.SpeechLogFilepath);

//
using (SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(speechConfig))
{

    //
    string strSsml = _textToSSML(Language, speechConfig.SpeechSynthesisVoiceName, strText);

    //
    SpeechSynthesisResult speechSynthesisResult = await speechSynthesizer.SpeakSsmlAsync(strSsml);

    //
    if (speechSynthesisResult.Reason == ResultReason.SynthesizingAudioCompleted)
    {

        // Process the wav and save as mp3
        WaveFile waveFile = new WaveFile(speechSynthesisResult.AudioData);
        _processAndSave(waveFile, audioFilepaths);

    }
    else
    {

        //
        new LogServices().AddTextToSpeechError(speechSynthesisResult, strText);

    }

}

Upvotes: 0

Views: 354

Answers (1)

Dasari Kamali
Dasari Kamali

Reputation: 3649

The below code is worked for me by using audioDataStream to convert text to speech then save the audio to a .wav file without hearing it first and writes that data to a .mp3 file with Riff24Khz16BitMonoPcm.

  1. I used NAudio.Wave package to read data from a .wav file.
  2. I used NAudio.Lame to write the data from .wav to a .mp3 file.

Code :

using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using NAudio.Wave;
using NAudio.Lame;

public class TextToSpeechService
{
    private string SubscriptionKey = "<speech_key>";
    private string SubscriptionRegion = "<speech_region>";

    public async Task GenerateSpeechAsync(string text, string outputFilePathWav, string outputFilePathMp3)
    {
        var speechConfig = SpeechConfig.FromSubscription(SubscriptionKey, SubscriptionRegion);
        speechConfig.SpeechSynthesisVoiceName = "de-DE-KatjaNeural";
        speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

        using var stream = AudioOutputStream.CreatePullStream();
        var audioConfig = AudioConfig.FromStreamOutput(stream);

        using var synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
        string ssml = _textToSSML("de-DE", speechConfig.SpeechSynthesisVoiceName, text);
        var result = await synthesizer.SpeakSsmlAsync(ssml);

        if (result.Reason == ResultReason.SynthesizingAudioCompleted)
        {
            var directory = Path.GetDirectoryName(outputFilePathWav);
            if (!string.IsNullOrEmpty(directory) && !Directory.Exists(directory))
            {
                Directory.CreateDirectory(directory);
            }

            using var audioDataStream = AudioDataStream.FromResult(result);
            await audioDataStream.SaveToWaveFileAsync(outputFilePathWav);

            ConvertWavToMp3(outputFilePathWav, outputFilePathMp3);
        }
        else if (result.Reason == ResultReason.Canceled)
        {
            var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
            Console.WriteLine($"Error synthesizing audio: {cancellation.Reason}");
            Console.WriteLine($"Error details: {cancellation.ErrorDetails}");
        }
    }
    
    private string _textToSSML(string language, string voice, string text)
    {
        return $"<speak version='1.0' xml:lang='{language}'><voice name='{voice}'>{text}</voice></speak>";
    }
    private void ConvertWavToMp3(string wavFilePath, string mp3FilePath)
    {
        using var reader = new WaveFileReader(wavFilePath);
        using var writer = new LameMP3FileWriter(mp3FilePath, reader.WaveFormat, LAMEPreset.STANDARD);
        reader.CopyTo(writer);
    }
}

class Program
{
    static async Task Main(string[] args)
    {
        var ttsService = new TextToSpeechService();
        string text = "Hallo, wie geht es Ihnen?";
        string outputFilePathWav = @"C:\Users\kamali\source\repos\ConsoleApp1\output.wav";
        string outputFilePathMp3 = @"C:\Users\kamali\source\repos\ConsoleApp1\output.mp3";
        await ttsService.GenerateSpeechAsync(text, outputFilePathWav, outputFilePathMp3);
        Console.WriteLine("Speech synthesis completed. Audio saved to " + outputFilePathMp3);
    }
}

Output :

The following text-to-speech code ran successfully and the audio was saved to a .mp3 file.

enter image description here

The .mp3 file was saved to the below file path.

enter image description here

Upvotes: 0

Related Questions