Reputation: 11
I'm trying to set up WebRTC Voice Activity Detector (VAD) for a VOIP call that is streaming through a websocket, to detect when the caller has stopped talking.
Most of the tutorials and questions around WebRTC VAD were based on recorded audio files, and not on a live stream. I would like to know how to implement it on a websocket streaming a VOIP call in real time.
According to the py-webrtcvad documentation (https://pypi.org/project/webrtcvad/):
Give it a short segment (“frame”) of audio. The WebRTC VAD only accepts 16-bit mono PCM audio, sampled at 8000, 16000, or 32000 Hz. A frame must be either 10, 20, or 30 ms in duration
How would one go about converting a raw audio stream coming from the websocket to the requirements needed above by the WebRTC VAD to work?
Upvotes: 1
Views: 2856
Reputation: 21
To use VAD, you need chunk size correctly. https://github.com/wiseman/py-webrtcvad/issues/30
- For example, if your sample rate is 16000 Hz, then the only allowed
frame/chunk sizes are:
16000 * ({10,20,30} / 1000) = 160, 320 or 480 samples.
- Since each sample is 2 bytes (16 bits), the only allowed frame/chunk sizes are
320, 640, or 960 bytes.
Upvotes: 2