Reputation: 3605
I have a voice application that would be much-improved if there was the ability to use a "trigger word" to start recording audio. I don't need a full speech-text engine, just the ability to reliably/efficiently detect the trigger word.
I am wondering if there are any specialized speech engines that support this specific use case, or any libraries/methods to developing such a single-purpose detection engine. Ideally I'd like it to work in noisy environments, but it can be trained for a single user's voice.
Pointers to research papers / topics would also be appreciated so I know what to ask for.
Upvotes: 4
Views: 3744
Reputation: 9793
A colleague of mine on the Red5 project created a similar demo using trigger words to cause a search to be run against an image repository. Saying "cat" caused an image of a cat to appear within about a second. The client application was written in Flash and the back-end ran on Red5 using the free Sphinx library. You could certainly do what you want with Sphinx without much effort.
Sphinx project: http://cmusphinx.sourceforge.net/sphinx4/
Upvotes: 2
Reputation: 123
I have a voice recording win32 app. I use an OCX to manage recording/playback.
I know it is not exactly the solution you are asking, but you might want to consider a foot pedal. It is simple to program and would serve very much like a spoken word to begin/stop recording. Check these: www.pedalpower.com
Hope it helps,
Reinaldo.
Upvotes: 0
Reputation: 28238
There were asked a question just a few days ago about speech recognition possibilities on linux. What you ask for is a subset of that, I assume some of those answers could contain useful information. The article linked in joeforker's answer was very interesting.
Upvotes: 0
Reputation: 56123
What O/S? I wonder for example whether Speech functionality in Windows Vista would help you. Recognising a single word seems like the simplest possible problem for any speech analyzer.
Upvotes: 0
Reputation: 86393
Okay, I could be completely off, but using a full featured speech-recognition library may be overkill for your use-case..
If you can live with something simpler but still audio driven consider this:
Detecting a hand-clap is very simple. A hand-clap will have high energy over the overall audio band. Detecting it is simple and much cheaper computational wise than full-bown speech recoginition.
In a nutshell you record the audio, do a (short time) FFT on the data and detect the case where you have high energy in 80% of the available frequency bins. 80% takes care of any phasing issues due to a simple recording-room/microphone setting. Then adjust the thresold to taste and you're done.
Doing the same with speech-recognition is possible as well, but you will burn tons of CPU cycles.
Upvotes: 1