Reputation: 340
I m studying compiler construction using python, I'm trying to create a list of all lowercased words in the text, and then produce BigramCollocationFinder
, which we can use to find bigrams, which are pairs of words.
These bigrams are found using association measurement functions in the nltk.metrics
package.
I'm practising from the "Python 3 Text Processing with NLTK 3 Cookbook" and I found this example code:
from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
words = [w.lower() for w in webtext.words('grail.txt')]
bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
I'm stuck at:
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
likelihood_ratio, 4
Here it mean similarity ratio or what does it means in this code.
Any guidance in this matter would be highly appreciated.
Upvotes: 1
Views: 4541
Reputation: 209
I believe NLTK collocations for specific words should answer your question. It calculates the PMI first and returns the top 4 words which occurs very frequently in your corpus.
Upvotes: 1