Reputation: 137
Unable to find where did my pattern go wrong to cause the outcome.
The Sentence I want to find:"#1 – January 31, 2015" and any date that follows this format.
The pattern pattern1=[{'ORTH':'#'},{'is_digital':True},{'is_space':True},{'ORTH':'-'},{'is_space':True},{'is_alpha':True},{'is_space':True},{'is_digital':True},{'is_punct':True},{'is_space':True},{'is_digital':True}]
The print code:print("Matches1:", [doc[start:end].text for match_id, start, end in matches1])
The result: ['#', '#', '#']
Expected result: ['#1 – January 31, 2015','#5 – March 15, 2017','#177 – Novenmber 22, 2019']
Upvotes: 0
Views: 152
Reputation: 2139
Spacy's matcher operates over tokens, single spaces in the sentence do not yield tokens. Also there are different characters which resemble hyphens : dashes, minus signs etc.. one has to be careful about that. The following code works:
import spacy
nlp = spacy.load('en_core_web_lg')
from spacy.matcher import Matcher
pattern1=[{'ORTH':'#'},{'IS_DIGIT':True},{'ORTH':'–'},{'is_alpha':True},{'IS_DIGIT':True},{'is_punct':True},{'IS_DIGIT':True}]
doc = nlp("#1 – January 31, 2015")
matcher = Matcher(nlp.vocab)
matcher.add("p1", None, pattern1)
matches1 = matcher(doc)
print(" Matches1:", [doc[start:end].text for match_id, start, end in matches1])
# Matches1: ['#1 – January 31, 2015']
Upvotes: 1