JFerro
JFerro

Reputation: 3433

spacy spans labels. How to add spans with a particular lable to a doc

Following spacy documentation I find that:

https://spacy.io/usage/visualizers#span

import spacy
from spacy import displacy
from spacy.tokens import Span

text = "Welcome to the Bank of China."

nlp = spacy.blank("en")
doc = nlp(text)

doc.spans["sc"] = [
    Span(doc, 3, 6, "ORG"), 
    Span(doc, 5, 6, "GPE"),
]

#displacy.serve(doc, style="span")

I can not understand why the lists of the spans are added with a key "sc". Every span has a label, i.e. ORG, GPE, etc, why to you need yet another qualifier?

Actually after adding the spans to the doc I can not understand why the spans are not Span classes anymore:

for span in doc.spans:
    print(type(span))

and that gives "str". and under

for span in doc.spans['sc']:
   print(type(span))

I found the spans. If every span has a label and is included in a list with a name "sc" (or whatever) what for is this double labeling of spans used for?

Upvotes: 0

Views: 1114

Answers (1)

polm23
polm23

Reputation: 15593

Doc.spans is like a dictionary, where each key is a string and each value is a SpanGroup, which is basically a list of spans.

The reason Doc.spans is a dictionary, instead of just a single list of spans, is so that you can have different components add lists of spans for different reasons, or have a single component add different groups of spans.

For example, if you a coreference component, it could use one SpanGroup for each "cluster", where a cluster is lists of spans that refer to the same thing. For the sentence "John Smith called from New York, he said it's raining there", ["John Smith", "he"] would be one cluster, and ["New York", "there"] would be another.

If you had a spancat component and also a coref component, they would both need to set Spans on the Doc, but you wouldn't want those spans to get mixed up; Doc.spans allows you to keeps things clean and separate.

Upvotes: 1

Related Questions