scicos88
scicos88

Reputation: 53

How to match repeating patterns in spacy?

I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).

For example, let's assume we analyze the following sentence:

"She told me that her dog was big, black and strong."

The following code would allow me to match the list of adjectives at the end of the sentence:

import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")

# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])


matches = matcher(doc)

Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".

How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True} can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}.

Thanks for any hints.

Upvotes: 2

Views: 983

Answers (1)

polm23
polm23

Reputation: 15593

The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.

patterns = []
for ii in range(1, 5):
    pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
    pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
    patterns.append(pattern)

Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.

As a separate note, you are not handling sentences with the serial comma.

Upvotes: 1

Related Questions