How to set entity information for token which is included in more than one span in entities in SpaCy?

Question

I'm a spacy beginner who is doing samples for learning purposes, I have referred to an article on how to create an address parser using SpaCy. My tutorial datasheet as follows

which is running perfectly,

Then I created my own data set which contains addresses in Denmark

but when I run the training command, there is an error,

ValueError: [E1010] Unable to set entity information for token 1 which is included in more than one span in entities, blocked, missing, or outside.

As per the questions asked I StackOverflow and other platforms, the reason for the error is duplicate words in a span

[18, Mbl Denmark A/S, Glarmestervej, 8600, Silkeborg, Denmark]

Recipient contains the word "Denmark" and Country contains the word "Demark"

can anyone suggest to me the solution to fix this

Code for Create DocBin object for building training/test

db = DocBin()
for text, annotations in training_data:
    doc = nlp(text) #Construct a Doc object
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
     doc.ents = ents
     db.add(doc)

polm23 · Accepted Answer

In general, entities can't be nested or overlapping, and if you have data like that you have to decide what kind of output you want.

If you actually want nested or overlapping annotations, you can use the spancat, which supports that.

In this case though, "Denmark" in "Mbl Denmark" is not really interesting and you probably don't want to annotate it. I would recommend you use filter_spans on your list of spans before assigning it to the Doc. filter_spans will take the longest (or first) span of any overlapping spans, resulting in a list of non-overlapping spans, which you can use for normal entity annotations.

How to set entity information for token which is included in more than one span in entities in SpaCy?

Answers (1)

Related Questions