Reputation: 11
I am making an simple server for Piper TTS using FastAPI. One of the APIs allows streaming the audio as it is being generated. However, I quickly find out that my current approach, which is borrowed from AllTalk TTS, creates improper WAV files. These files fail to play on FireFox and even crash Audacity, although can be played with Chrome, VLC and MPV with warning.
The current approach is as follow:
Here is the cut down version of what I am trying to do:
import wave
wav_buf = io.BytesIO()
with wave.open(wav_buf, "wb") as wav_bytes:
wav_bytes.setnchannels(1)
wav_bytes.setsampwidth(2)
wav_bytes.setframerate(model_metadata.sample_rate)
wav_file.writeframes(b"")
wav_buf.seek(0)
yield wav_buf.read()
for byte in synthesize_stream_raw(...):
yield byte
However, it seems that WAV expects 4 bytes indicating the size of the raw audio data before those raw data (wiki). As the header is already sent before the actual data is generated, and there is no way to know the size before hands, the final WAV file is malformed with this approach because the data size bytes is zero.
An old thread suggested setting the data size bytes to their max value. But I think that would create another problem. Currently, I just generate the complete file in memory and then send it, but that isn't really streaming as the audio is being generated.
What approach should I do here? Should I choose a different audio format?
Upvotes: 1
Views: 65