Simon Nazarenko
Simon Nazarenko

Reputation: 157

Azure cognitive services text-to-speech service "whispering" style adjustments

I am working on a project that requires voice over for videos. I was looking for a free/cheap option for a more natural voice synthesizer options and ran into an article suggesting using Azure TTS service. As of 1/23/2024 it is still true that Azure cognitive services text-to-speech service is free up to 0.5 million characters. Works well for what I'm doing.

I registered in Azure and created a TTS service. I chose en-US-NancyNeural as my primary voice as her "whispering" style sounded better than the others.

I would like to make the whispering voice softer than it comes by default. I figured using SSML is the correct approach for altering the TTS result. I was wondering if anyone can share their experience playing with options and making the whispering slower, softer and quieter (more natural). Though default Nancy's whispering is better than the others "she" still whispers very quickly and loudly, lol.

What works well? What does not work? Please, share your experience

Here is the sample of my TTS NodeJS function

async function generateSpeechFromText(name, text, tempDirectory) {
  console.log(`Generating speech from text for section: ${name}`)
  const audioFile = `${tempDirectory}/${name}.wav`

  const speechConfig = TTSSdk.SpeechConfig.fromSubscription(
    process.env.AZURE_TTS_KEY,
    process.env.AZURE_TTS_REGION
  )
  const audioConfig = TTSSdk.AudioConfig.fromAudioFileOutput(audioFile)
  speechConfig.speechSynthesisVoiceName = "en-US-NancyNeural"

  let synthesizer = new TTSSdk.SpeechSynthesizer(speechConfig, audioConfig)
  const ssml = `<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
                  <voice name="${speechConfig.speechSynthesisVoiceName}">
                    <mstts:express-as style="whispering">
                      ${text}
                    </mstts:express-as>
                  </voice>
                </speak>`

  return new Promise((resolve, reject) => {
    synthesizer.speakSsmlAsync(
      ssml,
      (result) => {
        if (result.reason === TTSSdk.ResultReason.SynthesizingAudioCompleted) {
          console.log("Synthesis finished for: " + name)
          resolve(audioFile)
        } else {
          console.error(
            "Speech synthesis failed for: " + name,
            result.errorDetails
          )
          reject(result.errorDetails)
        }
        synthesizer.close()
      },
      (err) => {
        console.error("Error during synthesis for: " + name, err)
        synthesizer.close()
        reject(err)
      }
    )
  })
}

And here is the link to the page that goes over SSML structure and events

Upvotes: 0

Views: 704

Answers (1)

Suresh Chikkam
Suresh Chikkam

Reputation: 3413

I would like to make the whispering voice softer than it comes by default. I figured using SSML is the correct approach for altering the TTS result.

SSML (Speech Synthesis Markup Language) can make use of various attributes to control the speed, volume, and pitch of the synthesized voice.

  • <prosody rate="slow">This is a slow whispering voice. <prosody volume="soft">This is a soft whispering voice. <prosody pitch="-50%">This is a lower-pitched whispering voice.

Configure like below:

const ssml = `<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
                  <voice name="${speechConfig.speechSynthesisVoiceName}">
                    <mstts:express-as style="whispering">
                      <prosody rate="slow" volume="soft" pitch="-50%">
                        ${text}
                      </prosody>
                    </mstts:express-as>
                  </voice>
                </speak>`;
  • I have taken example usage for the case, below is the full code check-out here.
const TTSSdk = require("microsoft-cognitiveservices-speech-sdk");

async function generateSpeechFromText(name, text, tempDirectory) {
  console.log(`Generating speech from text for section: ${name}`);
  const audioFile = `${tempDirectory}/${name}.wav`;

  const speechConfig = TTSSdk.SpeechConfig.fromSubscription(
    "tts-key",
    "region"
  );
  const audioConfig = TTSSdk.AudioConfig.fromAudioFileOutput(audioFile);
  speechConfig.speechSynthesisVoiceName = "en-US-NancyNeural";

  let synthesizer = new TTSSdk.SpeechSynthesizer(speechConfig, audioConfig);
  const ssml = `<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
                  <voice name="${speechConfig.speechSynthesisVoiceName}">
                    <mstts:express-as style="whispering">
                    //   <prosody rate="slow" volume="soft" pitch="-50%">
                        ${text}
                      </prosody>
                    </mstts:express-as>
                  </voice>
                </speak>`;

  return new Promise((resolve, reject) => {
    synthesizer.speakSsmlAsync(
      ssml,
      (result) => {
        if (result.reason === TTSSdk.ResultReason.SynthesizingAudioCompleted) {
          console.log("Synthesis finished for: " + name);
          resolve(audioFile);
        } else {
          console.error(
            "Speech synthesis failed for: " + name,
            result.errorDetails
          );
          reject(result.errorDetails);
        }
        synthesizer.close();
      },
      (err) => {
        console.error("Error during synthesis for: " + name, err);
        synthesizer.close();
        reject(err);
      }
    );
  });
}

// Example usage
const tempDirectory = "./output";
const sectionName = "example-modified";
const textToSynthesize = "Hello, this is a test whispering voice.";

generateSpeechFromText(sectionName, textToSynthesize, tempDirectory)
  .then((audioFile) => {
    console.log(`Audio file generated: ${audioFile}`);
  })
  .catch((error) => {
    console.error("Error generating speech:", error);
    console.error("Error during synthesis for: " + name, err);
    console.error("Error stack trace:", err.stack);

  });

Output:

enter image description here

Generated Audio-files: enter image description here

Upvotes: 0

Related Questions