Vaughn Hughes
Vaughn Hughes

Reputation: 1

Undesired Pause in Azure Speech Service Neural Voice Synthesis for Long Sentences

I attempted to use Azure Speach Service to synthesize long sentences using a Neural Voice having more than 500 characters without seeing a pause in the speech at 500 characters into a given sentence.

I have any number of places in text I need to synthesize (using Neural Voices) that has some very long sentences. (And no, I can't change them, unfortunately.) I've noticed that around 25-28 seconds into the sentence and apparently always at 500 characters into any sentence, the voice synthesizer will pause about as long as the end of a sentence. I can't have a pause like this. I cannot find anywhere in the documentation that talks about a sentence length limit or about any way to avoid that pause. (perhaps using some SSML tag?)

Can anyone provide some guidance on what can be done to avoid these gaps / pauses?

I know it's possible to edit MP3 files generated, but my workflow just isn't suited for doing so. (i.e., 40,000+ MP3 files, any one of which I may need to resynthesize as mispronunciations, etc. are found, and over 1200 of them have these long sentences with gaps in synthesis) I really need to be able to synthesize the sentences into MP3 files without needing to make manual fixes.

For those not familiar with how to synthesize speech from SSML using the Azure Speech Service SDK, the snippet of code I use is below. But again, it makes no difference whether you synthesize the SSML using this code or whether you use the Azure Speech Service portal. The resulting MP3 has the exact same problem.

        batchClient = new BatchSynthesisClient(host, speechKey);
        var ssmlFileWithoutExtension = Path.GetFileNameWithoutExtension(ssmlFile);
        Console.WriteLine($"  Enqueuing {ssmlFileWithoutExtension} ({ssmlFile})");
        var ssml = File.ReadAllText(ssmlFile);
        var newSynthesisUri = await batchClient.CreateSynthesisAsync(
            voiceName,
            ssmlFileWithoutExtension,
            "enqueued " + DateTime.Now.ToString("M/d/yyyy h:mmtt"),
            ssml,
            true).ConfigureAwait(false);
        var synthesisId = Guid.Parse(newSynthesisUri.Segments.Last());
// Wait for the job to complete using logic to check for completion
          var synthesis = await batchClient.GetSynthesisAsync(synthesisId).ConfigureAwait(false);
          var downloadUrl = synthesis.Outputs.Result;
          using (HttpResponseMessage response = await batchClient.HttpClient.GetAsync(downloadUrl))
          {
            using (Stream streamToReadFrom = await response.Content.ReadAsStreamAsync())
            {
              using (Stream streamToWriteTo = File.Open(zipFilePath, FileMode.Create))
              {
                await streamToReadFrom.CopyToAsync(streamToWriteTo);
              }
            }
          }

Upvotes: 0

Views: 370

Answers (1)

Vaughn Hughes
Vaughn Hughes

Reputation: 1

It turns out that this is a fundamental limitation of the Azure Speech Service, and it remains undocumented. 500 characters is the maximum sentence length it will synthesize before arbitrarily inserting a significant pause. (about as long as for a period at the end of a sentence) I'm hoping someone at Microsoft will document it, so the next person who comes along doesn't spend months & loads of money on this like I did.

Upvotes: 0

Related Questions