Reputation: 884
I went through the documentation of Google Text to Speech SSML. https://developers.google.com/assistant/actions/reference/ssml#prosody
So there is a tag called <Prosody/>
which as per the documentation of W3 Specification can accept an attribute called duration which is a value in seconds or milliseconds for the desired time to take to read the contained text.
So <speak><prosody duration='6s'>Hello, How are you?</prosody></speak>
should take 3 seconds for google text to speech to speak this! But when i try it here https://cloud.google.com/text-to-speech/ , its not working and also I tried it in rest API.
Does google text to speech doesn't take duration attribute into account? If they don't then is there a way to achieve the same?
Upvotes: 1
Views: 1484
Reputation: 2738
As Mr Lister already mentioned, the documentation clearly says.
<prosody>
Used to customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported.
The rate and volume attributes can be set according to the W3 specifications.
Using the UI interface you can test it.
In particular you can use things like
rate="low"
or
rate="80%"
to adjust the speed. However that is as far as you can go with Google TTS.
AWS Polly does support what you need, but only on Standard voices (not Neural).
Here is the documentation. Setting a Maximum Duration for Synthesized Speech
Polly also has a UI to do a quick test.
Upvotes: 1
Reputation: 74
There are two ways I know of to solve this:
First Option: call Google's API twice: use the first call to measure the time of the spoken audio, and the second call to adjust the rate parameter accordingly.
Second option: Post-process the audio using a specialized library such as ffmpeg
Upvotes: 1