Reputation: 837
What attributes of the tokens within a doc
object does a PhraseMatcher
check for and require to find a match?
For example, if I just create a doc
using
doc1 = nlp('lead')
then the 'lead' token is an ADJ, whereas if I have a doc such as
doc2 = nlp('lead plate')
then the 'lead' token is a NOUN.
If I add doc1
to a PhraseMatcher
instance, should I expect that this matcher finds a match in doc2
?
Similarly for if I have, e.g.
doc1 = nlp('Lead')
doc2 = nlp('lead')
I.e. is it case sensitive?
This is not to mention token attributes like dependency, etc. I didn't find the documentation to be clear on this.
Upvotes: 1
Views: 2907
Reputation: 7105
The PhraseMatcher
will match on the ORTH
value, i.e. the exact text. This lets it match large terminology lists and exact occurrences of strings, without having to worry about spaCy's tokenization. For more background on this, why the PhraseMatcher
can't work on other attributes, and possible solutions for case-insensitivity, see this discussion on the issue tracker.
If you want to match based on token attributes, you'll probably want to use the rule-based Matcher
instead:
pattern = [{"LOWER": "lead", "POS": "ADJ"}]
There's also this newly added example in the docs that shows how to use the Matcher
with token match patterns and regular expressions (or binary flags more generally). This can be useful to add your own custom token descriptions like different spellings.
You might also want to check out spacy-lookup
, a community plugin that uses the FlashText module and provides an alternative to the built-in PhraseMatcher
.
Upvotes: 3