hawk
hawk

Reputation: 130

N-grams based on POS tags : Spacy

I have a list of 20 rules to extract spacy tri-grams chunks from a sentence.

Chunks can be of pos-tags trigrams:-

Example Input:

"Education of children was our revenue earning secondary business."

Desired Output:

["Education of children","earning secondary business"]

I have already tried spacy Matcher and need something more optimised than running a for loop as the dataset is very large.

Upvotes: 0

Views: 1977

Answers (1)

tomjn
tomjn

Reputation: 5389

I think you are looking for rule-based matching. Your code will look something like:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

list_of_rules = [
    ["VERB", "ADJ", "NOUN"],
    ["NOUN", "VERB", "ADV"],
    ["NOUN", "ADP", "NOUN"],
    # more rules here...
]

rules = [[{"POS": i} for i in j] for j in list_of_rules]

matcher = Matcher(nlp.vocab)
matcher.add("rules", None, *rules)

doc = nlp("Education of children was our revenue earning secondary business.")
matches = matcher(doc)
print([doc[start:end].text for _, start, end in matches])

which will print

['Education of children', 'earning secondary business']

Upvotes: 1

Related Questions