ValueError: Unable to set entity for token 27 which is included in more than one span in entities

Question

I am trying to convert a dataset to .spacy by converting it first in doc and then to DocBin. The whole dataset file is accessible via GoogleDocs.

I run the following function:

def converter(data, outputFile):
    nlp = spacy.blank("en") # load a new spacy model
    doc_bin = DocBin() # create a DocBin object

    for text, annot in tqdm(data): # data in previous format
        doc = nlp.make_doc(text) # create doc object from text    
        ents = []
        
        for start, end, label in annot["entities"]: # add character indexes
            # supported modes: strict, contract, expand
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            # to avoid having the traceback; 
            # TypeError: object of type 'NoneType' has no len()
            if span is None:
                pass
            else:
                ents.append(span)
        doc.ents = ents # label the text with the ents
        doc_bin.add(doc)
        
    doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
    return f"Processed {len(doc_bin)}"

After running the function on the dataset, I got the traceback: ValueError: [E1010] Unable to set entity information for token 27 which is included in more than one span in entities, blocked, missing or outside.

After taking a close look into dataset file to look for the text that this traceback been raised, I found the following:

[('HereLongText..(abstract)',
  {'entities': [('0', '27', 'SpecificDisease'),
    ('80', '93', 'SpecificDisease'),
    ('260', '278', 'SpecificDisease'),
    ('615', '628', 'SpecificDisease'),
    ('673', '691', 'SpecificDisease'),
    ('754', '772', 'SpecificDisease')]})]

I do not know how to solve this issue.

ValueError: Unable to set entity for token 27 which is included in more than one span in entities

Answers (1)

Related Questions