Reputation: 3433
I have a bunch of regex in this way: (for simplicity the regex patters are very easy, the real case the regex are very long and barely incomprehensible since they are created automatically from other tool) I want to create spans in a doc based on those regex. This is the code:
import spacy
from spacy.tokens import Doc, Span, Token
import re
rx1 = ["blue","blue print"]
text = " this is blue but there is a blue print. The light is red and the heat is in the infra red."
my_regexes = {'blue':["blue","blue print"],
'red': ["red", "infra red"] }
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.text)
for name, rxs in my_regexes.items():
doc.spans[name] = []
for rx in rxs:
for i, match in enumerate(re.finditer(rx, doc.text)):
start, end = match.span()
span = doc.char_span(start, end, alignment_mode="expand")
# This is a Span object or None if match doesn't map to valid token sequence
span_to_add = Span(doc, span.start, span.end, label=name +str(i))
doc.spans[name].append(span_to_add)
if span is not None:
print("Found match:", name, start, end, span.text )
It works. Now I want to filter the spans in a way that when a series of tokens (for instance "infra red") contain another span ("red") only the longest one is kept. I saw this: How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?
but that looks to be for a matcher, and I can not make it work in my case. Since I would like to eliminate the token Span out of the document.
Any idea?
Upvotes: 0
Views: 464
Reputation: 11474
spacy.util.filter_spans
will do this. The answer is the same as the linked question, where matcher results are converted to spans in order to filter them with this function.
docs.spans[name] = spacy.util.filter_spans(doc.spans[name])
Upvotes: 1