Return match if one match from each of two patterns found using python spaCy PhraseMatcher

Question

I have multiple fragments of text, stored in a list that lets say look like this:

text = ['mary had a little lamb', 'julie had a little goat',
'julie enjoys eating pizza', 'mary went to the market', 
'in the market there was a lamb', 'my goat likes to drink coffee', 
'tara throws a ball for her goat', 'a goat and a kangaroo can often be friends',
'tara and mary like to drink beer']

I want to return a match only when a text fragment contains BOTH the name of an animal and a girls name. Hence, for the above text, I want it to return only these fragments:

['mary had a little lamb', 'julie had a little goat',
'tara throws a ball for her goat']

I get the feeling that I should be able to do this in spaCy by defining multiple patterns like this:

nlp = spacy.load("en_core_web_sm")
matcher = spacy.matcher.PhraseMatcher(nlp.vocab)

girls_names = ['mary', 'tara', 'julie']
animals = ['lamb', 'goat']

phrase_matcher.add('GIRLS_NAMES', None, *girls_names)
phrase_matcher.add('ANIMALS', None, *animals)

I have got spaCy working a bit to match keywords generally (code below), but I have no idea how to make it flag when one word from each pattern is matched, or even to have it print which pattern is being matched.

for fragment in text:
doc = nlp(fragment)
matches = phrase_matcher(doc)
print('MATCHED KEYWORDS:')
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
print ('FRAGMENT')
print(fragment)

Output:

MATCHED KEYWORDS:
mary
lamb
FRAGMENT
mary had a little lamb
MATCHED KEYWORDS:
julie
goat
FRAGMENT
julie had a little goat
MATCHED KEYWORDS:
julie
FRAGMENT
julie enjoys eating pizza
MATCHED KEYWORDS:
mary
FRAGMENT
mary went to the market
MATCHED KEYWORDS:
lamb
FRAGMENT
in the market there was a lamb
MATCHED KEYWORDS:
goat
FRAGMENT
my goat likes to drink coffee
MATCHED KEYWORDS:
tara
goat
FRAGMENT
tara throws a ball for her goat
MATCHED KEYWORDS:
goat
kangaroo
FRAGMENT
a goat and a kangaroo can often be friends
MATCHED KEYWORDS:
tara
mary
FRAGMENT
tara and mary like to drink beer

thorntonc · Accepted Answer

Use the match_id to match for both GIRLS_NAMES and ANIMALS in a phrase.

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)

girls_names = [nlp.make_doc(text) for text in ['mary', 'tara', 'julie']]
animals = [nlp.make_doc(text) for text in ['lamb', 'goat']]

phrase_matcher.add('GIRLS_NAMES', None, *girls_names)
phrase_matcher.add('ANIMALS', None, *animals)

text = ['mary had a little lamb', 'julie had a little goat',
'julie enjoys eating pizza', 'mary went to the market',
'in the market there was a lamb', 'my goat likes to drink coffee',
'tara throws a ball for her goat', 'a goat and a kangaroo can often be friends',
'tara and mary like to drink beer']

for fragment in text:
    doc = nlp(fragment)
    matches = phrase_matcher(doc)
    rule_ids = {nlp.vocab.strings[match[0]] for match in matches}
    if {'GIRLS_NAMES', 'ANIMALS'}.issubset(rule_ids):
        print(fragment)

Output:

mary had a little lamb
julie had a little goat
tara throws a ball for her goat

Return match if one match from each of two patterns found using python spaCy PhraseMatcher

Answers (1)

Related Questions