train a language model using Google Ngrams

Question

I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams for the same. However, being such a huge resource as it is, I don't think it is computationally possible to do on my PC. ( To process all N-grams, to train a language model).

So is there any way I can train a language model using Google Ngrams ? (Even python NLTK library does not support ngram language model anymore) Note - I know that a language model can be trained using ngrams, but given the vast size of Google N grams, how can a language model be trained using specifically Google ngrams?

Dan Salo · Accepted Answer

You ought to check out this slick code base from UC Berkley: https://github.com/adampauls/berkeleylm

In the examples/ folder, you will find a bash script make-binary-from-google.sh that creates a compact language model from the raw Google N-Grams. The resulting LM implements stupid backoff and utilizes a fast and efficient data structure described in the following paper: http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf

If you are just interested in the final trained LM, you can download it in a variety of languages from the Berkley-hosted website: http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/

train a language model using Google Ngrams

Answers (1)

Related Questions