eng2019
eng2019

Reputation: 1035

Convert NER SpaCy format to IOB format

I have data which is already labelled in SpaCy format. For example:

("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]})

But I want to try training it with any other NER model, such as BERT-NER, which requires IOB tagging instead. Is there any conversion code from SpaCy data format to IOB?

Thanks!

Upvotes: 10

Views: 10007

Answers (5)

gamingflexer
gamingflexer

Reputation: 339

import spacy
from spacy.gold import biluo_tags_from_offsets

data = data
nlp = spacy.blank("en")

for text, labels in data:
    doc = nlp("read our spacy format data here")
    ents = []

    for start, end, label in labels["entities"]:
            ents.append(doc.char_span(start, end, label))
    doc.ents = ents   
     
    for tok in doc:
        label = tok.ent_iob_
        if tok.ent_iob_ != "O":
            label += '-' + tok.ent_type_
        print(tok, label, sep="\t")

if getting none-type error do add try block depending on your dataset or clean your dataset.

Upvotes: 1

mohamed mzoughi
mohamed mzoughi

Reputation: 11

I have faced this kind of problem. what i did is transforming the data to spacy binary then I load the data from docbin object using this code.

import spacy
from spacy.tokens import DocBin
db=DocBin().from_disk("your_docbin_name.spacy")
nlp=spacy.blank("language_used")
Documents=list(db.get_docs(nlp.vocab))

` then this code may help you to extract the iob format from it.

for elem in Documents[0]:
    if(elem.ent_iob_!="O"):
        print(elem.text,elem.ent_iob_,"-",elem.ent_type_)
    else :
        print(elem.text,elem.ent_iob_)

here is the example of my output :

عبرت O
الديناميكية B - POLITIQUE
النسوية I - POLITIQUE
التي O
تأسست O
بعد O
25 O
جويلية O
2021 O
عن O
رفضها O
القطعي O
لمشروع O
تنقيح B - POLITIQUE
المرسوم B - POLITIQUE
عدد O
88 O
لسنة O

Upvotes: 1

Shamil Kayanolil
Shamil Kayanolil

Reputation: 11

First You need to convert your annotated json file to csv.
Then you can run the below code to convert into spaCy V2 Binary format

df = pd.read_csv('SC_CSV.csv')
l1 = []
l2 = []

for i in range(0, len(df['ner'])):
    l1.append(df['ner'][i])
    l2.append({"entities":[(0,len(df['ner'][i]),df['label'][i])]})
    
TRAIN_DATA = list(zip(l1, l2))
TRAIN_DATA 

Now the TRAIN_DATA in spaCy V2 format

This helps to convert the file from your old Spacy v2 formats to the brand new Spacy v3 format.

import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

Upvotes: 1

aab
aab

Reputation: 11474

This is closely related to and mostly copied from https://stackoverflow.com/a/59209377/461847, see the notes in the comments there, too:

import spacy
from spacy.gold import biluo_tags_from_offsets

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    tags = biluo_tags_from_offsets(doc, annot['entities'])
    # then convert L->I and U->B to have IOB tags for the tokens in the doc

Upvotes: 8

Jindřich
Jindřich

Reputation: 11230

I am afraid, you will have to write your own conversion because IOB encoding depends on what tokenization will the pre-trained representation model (BERT, RoBERTa or whatever pre-trained model of your choice) uses.

The SpaCy format specifies the character span of the entity, i.e.

"Who is Shaka Khan?"[7:17]

will return "Shaka Khan". You need to match that to tokens used by the pre-trained model.

Here are examples of how different models tokenize the example sentence when you used Huggingface's Transformers.

  • BERT: ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
  • RoBERTa: ['Who', '_is', '_Sh', 'aka', '_Khan', '?']
  • XLNet: ['▁Who', '▁is', '▁Shak', 'a', '▁Khan', '?']

When knowing how the tokenizer work, you can implement the conversion. Something like this can work for BERT tokenization.

entities = [(7, 17, "PERSON")]}
tokenized = ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']

cur_start = 0
state = "O" # Outside
tags = []
for token in tokenized:
    # Deal with BERT's way of encoding spaces
    if token.startswith("##"):
        token = token[2:]
    else:
        token = " " + token

    cur_end = cur_start + len(token)
    if state == "O" and cur_start < entities[0][0] < cur_end:
        tags.append("B-" + entitites[0][2])
        state = "I-" + entitites[0][2]
    elif state.startswith("I-") and cur_start < entities[0][1] < cur_end:
        tags.append(state)
        state = "O"
        entities.pop(0)
    else:
        tags.append(state)
    cur_start = cur_end

Note that the snippet would break if one BERT token would contain the end of one entity and the start of the following one. The tokenizer also does not distinguish how many spaces (or other whitespaces) there were in the original string, this is a potential source of errors as well.

Upvotes: 6

Related Questions