unhammer
unhammer

Reputation: 4740

webrtc vad for finding start of (possibly short) utterance

We'd like to know when in an audio file an utterance starts. The utterance can be a whole sentence or quite short, e.g. a single word. There may be some background noise (breathing, creaking, fans etc.). Currently we're using a simple threshold method (if there's a fairly high-volume sound, the user started speaking), but some times that fails if there's a loud enough noise.

We've been experimenting with webrtc-vad (hs, js), but it seems to give 1/True ("is voice") answers just as often to noise as to voice.

In example code using webrtc-vad I see they often look for sequences of 1-answers in a row / within a time span, as in mozilla's webrtcvad_js example code, but doing this doesn't seem to help us much. Continuously writing out the answer while testing is illuminating, e.g. the first series of 1's here is from me saying "i" and the second from me carefully putting my coffee cup on the table:

00000000000000000001111111000000000000000000001111111100000000000000

The sequences are about the same length :( Playing with aggressiveness only seems to make it slightly worse.

Is webrtc-vad simply not suitable for this task? Or could it still be useful as a first filter? Maybe a second filter should check that most of the sound is in the 50–300 Hz range? (I know that I could send it through a full text-to-speech pipeline and see if that manages to turn it into something legible, but that seems rather overkill for just finding out when someone starts speaking …)

Upvotes: 2

Views: 1134

Answers (1)

Nikolay Shmyrev
Nikolay Shmyrev

Reputation: 25220

There are more advanced VADs which use machine learning, they will perform better:

https://github.com/jtkim-kaist/VAD

I know that I could send it through a full text-to-speech pipeline and see if that manages to turn it into something legible, but that seems rather overkill for just finding out when someone starts speaking

No, it is not overkill, it is actually the right thing to do, it also helps the recognizer to properly estimate noise thus gives more accuracy.

Upvotes: 3

Related Questions