imhans33
imhans33

Reputation: 133

Converting Spacy NER entity format to CONLL 2003 format

I am working on NER application where i have data annotated in the following data format.

[('The F15 aircraft uses a lot of fuel', {'entities': [(4, 7, 'aircraft')]}),
 ('did you see the F16 landing?', {'entities': [(16, 19, 'aircraft')]}),
 ('how many missiles can a F35 carry', {'entities': [(24, 27, 'aircraft')]}),
 ('is the F15 outdated', {'entities': [(7, 10, 'aircraft')]}),
 ('how long does it take to train a F16 pilot',{'entities': [(33, 36, 'aircraft')]}),
 ('how much does a F35 cost', {'entities': [(16, 19, 'aircraft')]})]

Is there a way to convert this to CONLL 2003 format ?

Upvotes: 1

Views: 1506

Answers (1)

polm23
polm23

Reputation: 15633

Which CoNLL format do you mean?

You can get a simple CoNLL format by doing something like this:

import spacy

data = ... your data ...

nlp = spacy.blank("en")

for text, labels in data:
    doc = nlp(text)
    ents = []
    for start, end, label in labels["entities"]:
        ents.append(doc.char_span(start, end, label))
    doc.ents = ents
    for tok in doc:
        label = tok.ent_iob_
        if tok.ent_iob_ != "O":
            label += '-' + tok.ent_type_
        print(tok, label, sep="\t")

There is also a library, spacy_conll, that will do this for you.

Upvotes: 1

Related Questions