Janet Gilbert
Janet Gilbert

Reputation: 21

Azure TTS audio is distorted

In Javascript, I'm attempting to stream audio created in the format Audio16Khz32KBitRateMonoMp3 by the "microsoft-cognitiveservices-speech-sdk" SpeechSynthesizer via express to a react frontend app. The first couple of sentences sounds just fine but after that the speech is very distorted.

Here is the code that sends the audio:

    synthesizer.synthesizing = function (s, e) {
    currentAudioChunk = {
      audio: Buffer.from(e.result.audioData),
      offset: e.result.audioDuration / 10000, // Convert to milliseconds
    };

    sendEvent("audioData", {
      audio: currentAudioChunk.audio.toString("base64"),

      //his audioOffset data is null, and I'm sending it as a placeholder for now
      audioOffset: "0",
    });

    currentAudioChunk = null;
  };

When the audio is good the string sent looks like this:

"//NIxCElU/5IAY+IAYJ/2sMoSxHk0RH9BVzA0GTFmDQIAQ4Xv8iBn5oEqAUwtPDlBnyDh9f+ukXFoLdRAhSgoYRmQwQAIIJ0DVH/+eez/FwFkR+DcgYghGI/ImQQ0kU/////LRuTiDFw8fLRmk9Bv////v/+hTTN0Cuzk4TZ8qGCFf6UeMhhgB8BjVMd/t5V//NIxA4g2tqkAc9YAD7MqNv/AVisORkvL3sJA7Fje5e+5N7r/ZNxXO9991vl98n9A+bqGiY3myJPJAOEDjHGJfRMWtAsfGoO9ejd0GhoOy32zfEvfHvZL4vfvTvfV5xO2TeymMriv6//j+v5/j2MNzc+w4oEElx7v6u/2t/UcW0Tg+oDGEGSAs30kiaCvEsa//NIxA0f6tbFtGsQsB/wSwiDZ+tZaEobTf5rXCqf/XqycoAzWXtvSKlDpriVZX/mmfM+YFhrAo1ookTMDgiGJRKHii37Q6Eh+fFkyMUXF4c2/fTr+0rpvsyXQevBDueX8PPz/PPX88VXXxySFgZQJQqJRW71f/Ss+4XRpRipAO1XGSKjcylABX/VcJmGEBBG"

but when the distortion starts there are lots of repeated letters that appear to be junk, like this:

"//NIxHwAAANIAAAAAFVVVVVVVVVMQU1FMy4xMDBVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV//NIxHwAAANIAAAAAFVVVVVVVVVMQU1FMy4xMDBVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV//NIxHwAAANIAAAAAFVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV"

The repeated numbers are in the raw data received from Azure: it's not an artefact of conversion to string.

How can I get clean audio from Azure TTS?

I tried stripping out the Vs but that just corrupted the data entirely.

Upvotes: 0

Views: 65

Answers (1)

Naveen Sharma
Naveen Sharma

Reputation: 1298

WAV will be the default input format. If your input audio is compressed (in a format like MP3 or OPUS), you need to convert it to the WAV format and decode the audio buffers.

According to this DOC the Speech SDK for JavaScript accepts WAV files with a sampling rate of 16 kHz or 8 kHz, 16-bit depth, and mono PCM. .

Convert your decoding file to WAV and you can use WAV directly in audio streaming.

For text-to-speech use simple text and convert your Base64-encoded string to simple text using this doc or an online converter.

let  base64String = "SGVsbG8sIHdvcmxkIQ==";

let  decodedString = b64_to_utf8(base64String);

console.log("Decoded String:", decodedString);


Output :
enter image description here

Use the decoded string in Azure Text-to-Speech.Below code converts text to speech of wav format and save the audio to a file using Azure AI Speech .

Refer to this doc for information on using Azure AI services for text-to-speech in JavaScript.

WAV (function() {
    "use strict";
    
    const sdk = require("microsoft-cognitiveservices-speech-sdk");
    const readline = require("readline");

    const audioFile = "OutputAudio.wav";
    const speechConfig = sdk.SpeechConfig.fromSubscription(process.env.SPEECH_KEY, process.env.SPEECH_REGION);
    const audioConfig = sdk.AudioConfig.fromAudioFileOutput(audioFile);
    speechConfig.speechSynthesisVoiceName = "en-US-AvaMultilingualNeural"; 

    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, audioConfig);

    const rl = readline.createInterface({
        input: process.stdin,
        output: process.stdout
    });

    rl.question("Enter some text that you want to convert to speech:\n> ", function(text) {
        rl.close();

        synthesizer.speakTextAsync(text,
            function(result) {
                if (result.reason === sdk.ResultReason.SynthesizingAudioCompleted) {
                    console.log("Text-to-speech synthesis complete. Audio saved to:", audioFile);
                } else {
                    console.error("Error synthesizing speech:", result.errorDetails);
                }
                synthesizer.close();
            },
            function(err) {
                console.trace("Error:", err);
                synthesizer.close();
            });

        console.log("Now synthesizing to:", audioFile);
    });
}());


Output :

enter image description here

Upvotes: 0

Related Questions