Yan Anisimov
Yan Anisimov

Reputation: 39

How to reduce time of speech recognition in CMU Sphinx?

I want append speech recognition to asterisk server. I want try offline solution based on CMU Sphinx. But it work very slow. Reocgnition of simple dict(yes|no|normal) take about 20 seconds. I use this command:

pocketsphinx_continuous \
    -samprate 8000 \
    -dict my.dic \
    -lm ru.lm \
    -hmm zero_ru.cd_cont_4000 \
    -maxhmmpf 3000\
    -maxwpf 5\
    -topn 2\
    -ds 2\
    -logfn log.log \
    -remove_noise no \
    -infile 1.wav

Is it possible reduce time to 1-2 seconds or i must see to online solution(Google, Yandex etc)

Upvotes: 0

Views: 1094

Answers (3)

Borja SIXTO
Borja SIXTO

Reputation: 119

ASR and STT are 2 different things.

  • The Automatic Speech Recognition allows you to match a user speech selection relative to a definided grammar (GRXML, JSGF, ABNF).
  • The Speech To Text, converts any speech into text (with some errors sometimes).

In the case of PocketSphinx, you can use the server mode and connect with MRCP (check the project uniMRCP). It is more efficent to not load the DATAs + engine for each recognition, but start the server once and connect with one or more MRCP clients.

Upvotes: 0

Nikolay Shmyrev
Nikolay Shmyrev

Reputation: 25220

You have a number of mistakes in your attempt:

  • You try continuous model which is slow. It is better to use ptm model
  • You use language model while you can use a simple grammar
  • You run a command to recognize a short file, most of the time is taken to read the model. You need to use the server instead with model preloaded. Unimrcp server can process this request in 1/100 of second.
  • You remove words from the dictionary while you should keep it as is, you need to restrict the words in language model/grammar instead.

Proper command would be:

pocketsphinx_continuous \
    -samprate 8000 \
    -dict ru.dic \
    -lm my.jsgf \
    -hmm zero_ru.cd_ptm_4000 \
    -infile 1.wav

JSGF should look like this:

#JSGF V1.0;

grammar result;

public <result> = да | нет | нормально;

Whole time to run the command is

real    0m0.822s
user    0m0.789s
sys 0m0.028s

The actual recognition takes 0.02 seconds

INFO: fsg_search.c(265): TOTAL fsg 0.02 CPU 0.006 xRT

Upvotes: 2

arheops
arheops

Reputation: 15259

If you want to know, google cloud solution take 2.5-3.5 sec for 0-5sec recording.

Only faster option i know is google cloud in grpc(streaming realtime) version, which take 1sec after word end.

Speech recognition is VERY cpu intensive task. You can decrease recognition time by using faster CPU or using speech context with only few words. But it is really unlikly you get 10x faster recognition.

Upvotes: 1

Related Questions