Reputation: 21
From a document I want to generate all the n-grams that contain a certain word.
Example:
document: i am 50 years old, my son is 20 years old
word: years
n: 2
Output:
[(50, years), (years, old), (20, years), (years, old)]
I know we can generate all the possible n-grams and filter out the ones with the word but I was wondering if there is a more efficient way to do it. I was planning on using PySpark to generate them.
Upvotes: 2
Views: 1693
Reputation: 1747
from nltk.util import ngrams
DOC = 'i am 50 years old, my son is 20 years old'
def ngram_filter(doc, word, n):
tokens = doc.split()
all_ngrams = ngrams(tokens, n)
filtered_ngrams = [x for x in all_ngrams if word in x]
return filtered_ngrams
ngram_filter(DOC, 'years', 2)
Upvotes: 2