SkyTreasure
SkyTreasure

Reputation: 884

Is there a way to make Google Text to Speech, speak text for a desired duration?

I went through the documentation of Google Text to Speech SSML. https://developers.google.com/assistant/actions/reference/ssml#prosody

So there is a tag called <Prosody/> which as per the documentation of W3 Specification can accept an attribute called duration which is a value in seconds or milliseconds for the desired time to take to read the contained text.

So <speak><prosody duration='6s'>Hello, How are you?</prosody></speak> should take 3 seconds for google text to speech to speak this! But when i try it here https://cloud.google.com/text-to-speech/ , its not working and also I tried it in rest API.

Does google text to speech doesn't take duration attribute into account? If they don't then is there a way to achieve the same?

Upvotes: 1

Views: 1484

Answers (2)

Rub
Rub

Reputation: 2738

As Mr Lister already mentioned, the documentation clearly says.

<prosody>

Used to customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported.

The rate and volume attributes can be set according to the W3 specifications.

Using the UI interface you can test it.

enter image description here

In particular you can use things like

rate="low"

or

rate="80%"

to adjust the speed. However that is as far as you can go with Google TTS.


AWS Polly does support what you need, but only on Standard voices (not Neural).

Here is the documentation. Setting a Maximum Duration for Synthesized Speech

Polly also has a UI to do a quick test.

enter image description here

Upvotes: 1

Remi Zaidan
Remi Zaidan

Reputation: 74

There are two ways I know of to solve this:

  • First Option: call Google's API twice: use the first call to measure the time of the spoken audio, and the second call to adjust the rate parameter accordingly.

    • Pros: Better audio quality? (this is subjective and depends on taste as well as the application's requirements)
    • Cons: Doubles the cost and processing time.
  • Second option: Post-process the audio using a specialized library such as ffmpeg

    • Pros: Cost effective and can be fast if implemented correctly.
    • Cons: Some knowledge of the concepts and the usage of an audio post-processing library is required (no need to become an expert though).

Upvotes: 1

Related Questions