sehugg
sehugg

Reputation: 3605

"Voice trigger" detection

I have a voice application that would be much-improved if there was the ability to use a "trigger word" to start recording audio. I don't need a full speech-text engine, just the ability to reliably/efficiently detect the trigger word.

I am wondering if there are any specialized speech engines that support this specific use case, or any libraries/methods to developing such a single-purpose detection engine. Ideally I'd like it to work in noisy environments, but it can be trained for a single user's voice.

Pointers to research papers / topics would also be appreciated so I know what to ask for.

Upvotes: 4

Views: 3744

Answers (5)

Paul Gregoire
Paul Gregoire

Reputation: 9793

A colleague of mine on the Red5 project created a similar demo using trigger words to cause a search to be run against an image repository. Saying "cat" caused an image of a cat to appear within about a second. The client application was written in Flash and the back-end ran on Red5 using the free Sphinx library. You could certainly do what you want with Sphinx without much effort.
Sphinx project: http://cmusphinx.sourceforge.net/sphinx4/

Upvotes: 2

reinaldo Crespo
reinaldo Crespo

Reputation: 123

I have a voice recording win32 app. I use an OCX to manage recording/playback.

I know it is not exactly the solution you are asking, but you might want to consider a foot pedal. It is simple to program and would serve very much like a spoken word to begin/stop recording. Check these: www.pedalpower.com

Hope it helps,

Reinaldo.

Upvotes: 0

hlovdal
hlovdal

Reputation: 28238

There were asked a question just a few days ago about speech recognition possibilities on linux. What you ask for is a subset of that, I assume some of those answers could contain useful information. The article linked in joeforker's answer was very interesting.

Upvotes: 0

ChrisW
ChrisW

Reputation: 56123

What O/S? I wonder for example whether Speech functionality in Windows Vista would help you. Recognising a single word seems like the simplest possible problem for any speech analyzer.

Upvotes: 0

Nils Pipenbrinck
Nils Pipenbrinck

Reputation: 86393

Okay, I could be completely off, but using a full featured speech-recognition library may be overkill for your use-case..

If you can live with something simpler but still audio driven consider this:

Detecting a hand-clap is very simple. A hand-clap will have high energy over the overall audio band. Detecting it is simple and much cheaper computational wise than full-bown speech recoginition.

In a nutshell you record the audio, do a (short time) FFT on the data and detect the case where you have high energy in 80% of the available frequency bins. 80% takes care of any phasing issues due to a simple recording-room/microphone setting. Then adjust the thresold to taste and you're done.

Doing the same with speech-recognition is possible as well, but you will burn tons of CPU cycles.

Upvotes: 1

Related Questions