Reputation: 494
I am in the process of developing a Persian text-to-speech (TTS) engine for NVDA screen reader considering neural TTS models for real-time speech synthesis. I need advice on the following implementation aspects:
1. Real-time Performance of Neural TTS Models: I’ve been experimenting with models like FastSpeech2, but I’m encountering about a one-second latency for short utterances on a CPU. Are there ways to optimize neural TTS models for real-time applications, or is this delay inherent to the architecture?
2. Microsoft Speech API 5 in NVDA: NVDA uses the Microsoft Speech API 5, which delivers fast and fairly natural speech. Does anyone know if this is based on neural TTS or concatenative synthesis? How does it manage to maintain such low latency?
3. Sentence-Level vs Word-Level Speech Generation: When implementing a screen reader, should I generate speech at the word level to minimize latency, or is sentence-level processing preferable for better naturalness? How do existing screen readers balance this trade-off?
4. Best Practices for TTS Integration in a Screen Reader: What are the best practices for integrating neural TTS systems into a screen reader with minimal latency, especially on CPU-based systems?
Any advice or resources on optimizing the real-time performance of neural TTS models or other implementation tips would be greatly appreciated.
Upvotes: 0
Views: 44