Reputation: 1035
I have data which is already labelled in SpaCy format. For example:
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]})
But I want to try training it with any other NER model, such as BERT-NER, which requires IOB tagging instead. Is there any conversion code from SpaCy data format to IOB?
Thanks!
Upvotes: 10
Views: 10007
Reputation: 339
import spacy
from spacy.gold import biluo_tags_from_offsets
data = data
nlp = spacy.blank("en")
for text, labels in data:
doc = nlp("read our spacy format data here")
ents = []
for start, end, label in labels["entities"]:
ents.append(doc.char_span(start, end, label))
doc.ents = ents
for tok in doc:
label = tok.ent_iob_
if tok.ent_iob_ != "O":
label += '-' + tok.ent_type_
print(tok, label, sep="\t")
if getting none-type error do add try block depending on your dataset or clean your dataset.
Upvotes: 1
Reputation: 11
I have faced this kind of problem. what i did is transforming the data to spacy binary then I load the data from docbin object using this code.
import spacy
from spacy.tokens import DocBin
db=DocBin().from_disk("your_docbin_name.spacy")
nlp=spacy.blank("language_used")
Documents=list(db.get_docs(nlp.vocab))
` then this code may help you to extract the iob format from it.
for elem in Documents[0]:
if(elem.ent_iob_!="O"):
print(elem.text,elem.ent_iob_,"-",elem.ent_type_)
else :
print(elem.text,elem.ent_iob_)
here is the example of my output :
عبرت O
الديناميكية B - POLITIQUE
النسوية I - POLITIQUE
التي O
تأسست O
بعد O
25 O
جويلية O
2021 O
عن O
رفضها O
القطعي O
لمشروع O
تنقيح B - POLITIQUE
المرسوم B - POLITIQUE
عدد O
88 O
لسنة O
Upvotes: 1
Reputation: 11
First You need to convert your annotated json
file to csv
.
Then you can run the below code to convert into spaCy V2 Binary
format
df = pd.read_csv('SC_CSV.csv')
l1 = []
l2 = []
for i in range(0, len(df['ner'])):
l1.append(df['ner'][i])
l2.append({"entities":[(0,len(df['ner'][i]),df['label'][i])]})
TRAIN_DATA = list(zip(l1, l2))
TRAIN_DATA
Now the TRAIN_DATA
in spaCy V2
format
This helps to convert the file from your old Spacy v2
formats to the brand new Spacy v3
format.
import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)
db.to_disk("./train.spacy") # save the docbin object
Upvotes: 1
Reputation: 11474
This is closely related to and mostly copied from https://stackoverflow.com/a/59209377/461847, see the notes in the comments there, too:
import spacy
from spacy.gold import biluo_tags_from_offsets
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
# then convert L->I and U->B to have IOB tags for the tokens in the doc
Upvotes: 8
Reputation: 11230
I am afraid, you will have to write your own conversion because IOB encoding depends on what tokenization will the pre-trained representation model (BERT, RoBERTa or whatever pre-trained model of your choice) uses.
The SpaCy format specifies the character span of the entity, i.e.
"Who is Shaka Khan?"[7:17]
will return "Shaka Khan"
. You need to match that to tokens used by the pre-trained model.
Here are examples of how different models tokenize the example sentence when you used Huggingface's Transformers.
['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
['Who', '_is', '_Sh', 'aka', '_Khan', '?']
['▁Who', '▁is', '▁Shak', 'a', '▁Khan', '?']
When knowing how the tokenizer work, you can implement the conversion. Something like this can work for BERT tokenization.
entities = [(7, 17, "PERSON")]}
tokenized = ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
cur_start = 0
state = "O" # Outside
tags = []
for token in tokenized:
# Deal with BERT's way of encoding spaces
if token.startswith("##"):
token = token[2:]
else:
token = " " + token
cur_end = cur_start + len(token)
if state == "O" and cur_start < entities[0][0] < cur_end:
tags.append("B-" + entitites[0][2])
state = "I-" + entitites[0][2]
elif state.startswith("I-") and cur_start < entities[0][1] < cur_end:
tags.append(state)
state = "O"
entities.pop(0)
else:
tags.append(state)
cur_start = cur_end
Note that the snippet would break if one BERT token would contain the end of one entity and the start of the following one. The tokenizer also does not distinguish how many spaces (or other whitespaces) there were in the original string, this is a potential source of errors as well.
Upvotes: 6