Incorporating Lmtool into PocketSphinx?

Question

I am trying to create a simple way to add new keywords into PocketSphinx. The idea is to have a temporary text file that can be used to (via a script) add a word (or phrase) automatically added to the corpus.txt, dictionary.dic and language_model.lm.

Currently the best way to do this seems to be to use lmtool and then replace the aforementioned files with the updated versions. However this presents three problems:

Lmtool is slow for large libraries, so the process will become exponentially slower as more words are added.
Lmtool requires a semi-reliable internet connection to work and I'd like to be able to add commands while offline.
This is not the most efficient way to add commands, and won't work with the setup I'm putting together.

What I'd like to be able to do is to (if possible) use/create an offline version of lmtool that takes inputs from a temporary text file (input.txt) processes them and prints the contents into three temporary text files (dic.txt, lm.txt, corp.txt).

The last step would be to run a script that will:

Take the output in corp.txt and add it to the end of corpus.txt.
Look through dictionary.dic and add any new words in dic.txt.
Somehow modify language_model.lm to include the new terms in lm.txt.
Erase the contents of the three output files.

My question is if it is possible to get an offline version of lmtool that is capable of outputting results into specific text files? I know it is possible to automate lmtool (according to their site), but I would like to be able to run the process offline if possible.

Also, has anyone attempted something like this before that I can use as a guide?

I am running pocketsphinx on a raspberry pi and I am aware that it will likely not be able to run lmtool on it's own. My plan is to have lmtool run on a local server and sync files with the pi via wifi/ethernet.

Any help would be appreciated.

g10dras · Accepted Answer

You have few choice, If you want to generate dict and language model locally on Raspberry Pi (Model 2B at least)

For Language Model generation, you can use either

CMUCLMTK or
SRILM (SRI Language Modeling Toolkit)

To compile SRILM on Raspbian you need to tweak some files. Take a look here https://github.com/G10DRAS/SRILM-on-RaspberryPi

For dictionary generation, you can use either

Phonetisaurus with G2P Model available here (or you can generate FST by yourself using phonetisaurus-cmudict-split), or
g2p-seq2seq (Sequence-to-Sequence G2P toolkit)

g2p-seq2seq based on TensorFlow which is not officially supported on RaspberryPi. For more details see Installing TensorFlow on Raspberry Pi 3

For more details (usage, how to compile etc....) please go through the documents of respective toolkit.

Incorporating Lmtool into PocketSphinx?

Answers (1)

Related Questions