Reputation: 133
I am working on NER application where i have data annotated in the following data format.
[('The F15 aircraft uses a lot of fuel', {'entities': [(4, 7, 'aircraft')]}),
('did you see the F16 landing?', {'entities': [(16, 19, 'aircraft')]}),
('how many missiles can a F35 carry', {'entities': [(24, 27, 'aircraft')]}),
('is the F15 outdated', {'entities': [(7, 10, 'aircraft')]}),
('how long does it take to train a F16 pilot',{'entities': [(33, 36, 'aircraft')]}),
('how much does a F35 cost', {'entities': [(16, 19, 'aircraft')]})]
Is there a way to convert this to CONLL 2003 format ?
Upvotes: 1
Views: 1506
Reputation: 15633
Which CoNLL format do you mean?
You can get a simple CoNLL format by doing something like this:
import spacy
data = ... your data ...
nlp = spacy.blank("en")
for text, labels in data:
doc = nlp(text)
ents = []
for start, end, label in labels["entities"]:
ents.append(doc.char_span(start, end, label))
doc.ents = ents
for tok in doc:
label = tok.ent_iob_
if tok.ent_iob_ != "O":
label += '-' + tok.ent_type_
print(tok, label, sep="\t")
There is also a library, spacy_conll, that will do this for you.
Upvotes: 1