Reputation: 1480
The rule-based pattern matching in spaCy returns a match ID along with the start and end characters of the matched span, but I don't see anything in the documentation that says how to determine what parts of that span made up the tokens that were matched.
In regex, I can put parens around groups to select them and have them "selected" and brought out of the pattern. Is this possible with spaCy?
For example, I have this text (from Dracula):
They wore high boots, with their trousers tucked into them, and had long black hair and heavy black moustaches.
And I've defined an experiment:
import spacy
from spacy.matcher import Matcher
def test_match(text, patterns):
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
matcher.add('Boots', None, patterns)
doc = nlp(text)
matches = matcher(doc)
for match in matches:
match_id, start, end = match
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match, span.text)
text_a = "They wore high boots, with their trousers tucked into them, " \
"and had long black hair and heavy black moustaches."
patterns = [
{'POS': 'PRON'},
{'TAG': 'VBD'},
{'POS': 'ADJ'},
{'TAG': 'NNS'}
]
test_match(text_a, patterns)
This outputs:
(18231591219755621867, 0, 4) They wore high boots
For a simple pattern like this, with four tokens in a row, I can assume that token 0 is the pronoun, token 1 is the past-tense verb, etc. But for patterns with quantity modifiers, it becomes ambiguous. But is it possible to have spaCy tell me which tokens actually matched the components of the pattern?
For example, take this modification added to the experiment above, with two wildcards in the pattern and a new version of the text missing the adjective "high":
text_b = "They wore boots, with their trousers tucked into them, " \
"and had long black hair and heavy black moustaches."
patterns = [
{'POS': 'PRON'},
{'TAG': 'VBD'},
{'POS': 'ADJ', 'OP': '*'},
{'TAG': 'NNS', 'OP': '*'}
]
test_match(text_a, patterns)
print()
test_match(text_b, patterns)
Which outputs:
(18231591219755621867, 0, 2) They wore
(18231591219755621867, 0, 3) They wore high
(18231591219755621867, 0, 4) They wore high boots
(18231591219755621867, 0, 2) They wore
(18231591219755621867, 0, 3) They wore boots
In both output cases, it is unclear which of the final tokens is an adjective and which is a plural noun. I suppose I can loop over the tokens in the span, then manually match against the search parts of the pattern, but that's decidedly repetitive. Since I assume spaCy has to find them to match them, can't it just tell me which is which?
Upvotes: 1
Views: 1738
Reputation: 71
Since spaCy v3.06, it is now possible to get the match alignment information as part of the match tuple (api doc link).
matches = matcher(doc, with_alignments=True)
In your example, it will generate the following output :
(1618900948208871284, 0, 2, [0, 1]) They wore
(1618900948208871284, 0, 3, [0, 1, 2]) They wore high
(1618900948208871284, 0, 4, [0, 1, 2, 3]) They wore high boots
(1618900948208871284, 0, 2, [0, 1]) They wore
(1618900948208871284, 0, 3, [0, 1, 3]) They wore boots
Upvotes: 7