Reputation: 3394
I'm a spacy beginner who is doing samples for learning purposes, I have referred to an article on how to create an address parser using SpaCy.
My tutorial datasheet as follows
which is running perfectly,
Then I created my own data set which contains addresses in Denmark
but when I run the training command, there is an error,
ValueError: [E1010] Unable to set entity information for token 1 which is included in more than one span in entities, blocked, missing, or outside.
As per the questions asked I StackOverflow and other platforms, the reason for the error is duplicate words in a span
[18, Mbl Denmark A/S, Glarmestervej, 8600, Silkeborg, Denmark]
Recipient contains the word "Denmark" and Country contains the word "Demark"
can anyone suggest to me the solution to fix this
Code for Create DocBin object for building training/test
db = DocBin()
for text, annotations in training_data:
doc = nlp(text) #Construct a Doc object
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents
db.add(doc)
Upvotes: 1
Views: 584
Reputation: 15623
In general, entities can't be nested or overlapping, and if you have data like that you have to decide what kind of output you want.
If you actually want nested or overlapping annotations, you can use the spancat, which supports that.
In this case though, "Denmark" in "Mbl Denmark" is not really interesting and you probably don't want to annotate it. I would recommend you use filter_spans
on your list of spans before assigning it to the Doc. filter_spans
will take the longest (or first) span of any overlapping spans, resulting in a list of non-overlapping spans, which you can use for normal entity annotations.
Upvotes: 1