GratefulGuest
GratefulGuest

Reputation: 837

What does the spaCy PhraseMatcher match on?

What attributes of the tokens within a doc object does a PhraseMatcher check for and require to find a match?

For example, if I just create a doc using

doc1 = nlp('lead')

then the 'lead' token is an ADJ, whereas if I have a doc such as

doc2 = nlp('lead plate')

then the 'lead' token is a NOUN.

If I add doc1 to a PhraseMatcher instance, should I expect that this matcher finds a match in doc2?

Similarly for if I have, e.g.

doc1 = nlp('Lead')
doc2 = nlp('lead')

I.e. is it case sensitive?

This is not to mention token attributes like dependency, etc. I didn't find the documentation to be clear on this.

Upvotes: 1

Views: 2907

Answers (1)

Ines Montani
Ines Montani

Reputation: 7105

The PhraseMatcher will match on the ORTH value, i.e. the exact text. This lets it match large terminology lists and exact occurrences of strings, without having to worry about spaCy's tokenization. For more background on this, why the PhraseMatcher can't work on other attributes, and possible solutions for case-insensitivity, see this discussion on the issue tracker.

If you want to match based on token attributes, you'll probably want to use the rule-based Matcher instead:

pattern = [{"LOWER": "lead", "POS": "ADJ"}]

There's also this newly added example in the docs that shows how to use the Matcher with token match patterns and regular expressions (or binary flags more generally). This can be useful to add your own custom token descriptions like different spellings.

You might also want to check out spacy-lookup, a community plugin that uses the FlashText module and provides an alternative to the built-in PhraseMatcher.

Upvotes: 3

Related Questions