Reputation: 493
In my spaCy project, I would like to initialize a Doc object with text, labels and whitespaces. spaCy doesn't appreciate the way I provide the labels however, and shows its lack of appreciation in the following error message:
doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces) File "spacy\tokens\doc.pyx", line 297, in spacy.tokens.doc.Doc.__init__ ValueError: [E177] Ill-formed IOB input detected: ('', 'O')
The code:
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")
token_texts = ["I", "like", "potatoes", "!"]
labels = [("", "O"), ("", "O"), ("food", "I"), ("", "O")]
whitespaces = [True, True, False, False]
doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)
Does anyone know how to exactly serve spaCy the entities on the silver platter?
The spaCy Doc documentation states
ents: A list of strings, of the same length of words, to assign the token-based IOB tag. Defaults to None. Optional[List[str]]
The type-hint List[str]
made me attempt ["", "", "food", ""]
, which however results in the same error message.
Stackoverflow links that do not have the answer:
Convert NER SpaCy format to IOB format
Convert list of IOB formatted data to simple IOB formatted data
Failed to convert iob to spaCy binary format
Replace to entity tags to IOB format
Upvotes: 1
Views: 693
Reputation: 15593
IOB tags should be in the same format used in CoNLL files, so like "B-PERSON". So in your example code:
labels = ["O", "O", "I-FOOD", "O"]
Upvotes: 1