Reputation: 3427
Given some text, how can i get the most common n-gram across n=1 to 6? I've seen methods to get it for 3-gram, or 2-gram, one n at a time, but is there any way to extract the max-length phrase that makes the most sense, and all the rest too?
for example, in this text for demo-purpose only:
fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.
The ideal outcome of n-gram and their counter would be:
fri evening commute: 3,
off-peak: 2,
rest of the words: 1
any advice appreciated. Thanks.
Upvotes: 0
Views: 3681
Reputation:
I would advise this if you plan to use R: https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html
Upvotes: 1
Reputation: 2222
Python
Consider the NLTK library which offers an ngrams function that you can use to iterate over values of n.
A rough implementation would be along the lines of the following, where rough is the keywords here:
from nltk import ngrams
from collections import Counter
result = []
sentence = 'fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.'
# Since you are not considering periods and treats words with - as phrases
sentence = sentence.replace('.', '').replace('-', ' ')
for n in range(len(sentence.split(' ')), 1, -1):
phrases = []
for token in ngrams(sentence.split(), n):
phrases.append(' '.join(token))
phrase, freq = Counter(phrases).most_common(1)[0]
if freq > 1:
result.append((phrase, n))
sentence = sentence.replace(phrase, '')
for phrase, freq in result:
print('%s: %d' % (phrase, freq))
As for R
Upvotes: 5