Find all the n-grams that contain a certain word efficiently

Question

From a document I want to generate all the n-grams that contain a certain word.

Example:

document: i am 50 years old, my son is 20 years old
word: years
n: 2

Output:

[(50, years), (years, old), (20, years), (years, old)]

I know we can generate all the possible n-grams and filter out the ones with the word but I was wondering if there is a more efficient way to do it. I was planning on using PySpark to generate them.

Find all the n-grams that contain a certain word efficiently

Answers (1)

Related Questions