ace allen
ace allen

Reputation: 21

Find all the n-grams that contain a certain word efficiently

From a document I want to generate all the n-grams that contain a certain word.

Example:

document: i am 50 years old, my son is 20 years old
word: years
n: 2

Output:

[(50, years), (years, old), (20, years), (years, old)]

I know we can generate all the possible n-grams and filter out the ones with the word but I was wondering if there is a more efficient way to do it. I was planning on using PySpark to generate them.

Upvotes: 2

Views: 1693

Answers (1)

Stefanus
Stefanus

Reputation: 1747

from nltk.util import ngrams

DOC = 'i am 50 years old, my son is 20 years old'


def ngram_filter(doc, word, n):
    tokens = doc.split()
    all_ngrams = ngrams(tokens, n)
    filtered_ngrams = [x for x in all_ngrams if word in x]
    return filtered_ngrams


ngram_filter(DOC, 'years', 2)

Upvotes: 2

Related Questions