Lukasz
Lukasz

Reputation: 19916

ARPA language model documentation

Where can I find documentation on ARPA language model format?

I am developing simple speech recognition app with pocket-sphinx STT engine. ARPA is recommended there for performance reasons. I want to understand how much can I do to adjust my language model for my custom needs.

All I found is some very brief ARPA format descriptions:

I am beginner to STT and I have trouble to wrap head around this (n-grams, etc...). I am looking for more detailed docs. Something like documentation on JSGF grammar here:

http://www.w3.org/TR/jsgf/

Upvotes: 18

Views: 11491

Answers (3)

0x5050
0x5050

Reputation: 1231

I'm probably very late to answer this, I found the ARPA LM format well documented in this link from The HTK Book by Steve Young et. al.

Each line of ARPA is a triple that stores:

n-gram log-probability(base10) ; the n-gram itself ; back-off weight (also in log space). 

Upvotes: 2

needsaname
needsaname

Reputation: 73

You can complement those docs with this tech report that gives a comprehensive overview of smoothing for language modeling: http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf You will also find definitions for backoff models and interpolated models.

Upvotes: 2

Dariusz
Dariusz

Reputation: 22241

There is actually not much more to say about the format than is said in those docs..

Besides, you'll probably want to prepare a text file with sample sentences and generate the language file based on it. There is an online version which can do it for you: lmtool

Upvotes: 4

Related Questions