JFerro
JFerro

Reputation: 3433

filter custom spans overlaps in spacy doc

I have a bunch of regex in this way: (for simplicity the regex patters are very easy, the real case the regex are very long and barely incomprehensible since they are created automatically from other tool) I want to create spans in a doc based on those regex. This is the code:

import spacy
from spacy.tokens import Doc, Span, Token
import re

rx1 = ["blue","blue print"]
text = " this is blue but there is a blue print. The light is red and the heat is in the infra red."

my_regexes = {'blue':["blue","blue print"],
              'red': ["red", "infra red"] }
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.text)
for name, rxs in my_regexes.items():
    doc.spans[name] = []
    for rx in rxs:
        for i, match in enumerate(re.finditer(rx, doc.text)):
            start, end = match.span()
            span = doc.char_span(start, end, alignment_mode="expand")
            # This is a Span object or None if match doesn't map to valid token sequence
            span_to_add = Span(doc, span.start, span.end, label=name +str(i))
            doc.spans[name].append(span_to_add) 
            if span is not None:
                print("Found match:", name, start, end, span.text )
            

It works. Now I want to filter the spans in a way that when a series of tokens (for instance "infra red") contain another span ("red") only the longest one is kept. I saw this: How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?

but that looks to be for a matcher, and I can not make it work in my case. Since I would like to eliminate the token Span out of the document.

Any idea?

Upvotes: 0

Views: 464

Answers (1)

aab
aab

Reputation: 11474

spacy.util.filter_spans will do this. The answer is the same as the linked question, where matcher results are converted to spans in order to filter them with this function.

docs.spans[name] = spacy.util.filter_spans(doc.spans[name])

Upvotes: 1

Related Questions