Reputation: 334
I am using an inbuilt model of Spacy that is en_core_web_lg
and want to train it using my custom entities. While doing that, I am facing two issues,
It overwrite the new trained data with the old one and results in not recognizing the other entities. for example, Before training, it can recognize the PERSON and ORG but after training it doesn't recognize the PERSON and ORG.
During the training process, it is giving me the following error,
UserWarning: [W030] Some entities could not be aligned in the text "('I work in Google.',)" with entities "[(9, 15, 'ORG')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
Here is my whole code,
import spacy
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training.example import Example
sentence = ""
body1 = "James work in Facebook and love to have tuna fishes in the breafast."
nlp_lg = spacy.load("en_core_web_lg")
print(nlp_lg.pipe_names)
doc = nlp_lg(body1)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
train = [
('I had tuna fish in breakfast', {'entities': [(6,14,'FOOD')]}),
('I love prawns the most', {'entities': [(6,12,'FOOD')]}),
('fish is the rich source of protein', {'entities': [(0,4,'FOOD')]}),
('I work in Google.', {'entities': [(9,15,'ORG')]})
]
ner = nlp_lg.get_pipe("ner")
for _, annotations in train:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
disable_pipes = [pipe for pipe in nlp_lg.pipe_names if pipe != 'ner']
with nlp_lg.disable_pipes(*disable_pipes):
optimizer = nlp_lg.resume_training()
for interation in range(30):
random.shuffle(train)
losses = {}
batches = minibatch(train, size=compounding(1.0,4.0,1.001))
for batch in batches:
text, annotation = zip(*batch)
doc1 = nlp_lg.make_doc(str(text))
example = Example.from_dict(doc1, annotations)
nlp_lg.update(
[example],
drop = 0.5,
losses = losses,
sgd = optimizer
)
print("Losses",losses)
doc = nlp_lg(body1)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Expected Output :
James 0 5 PERSON
Facebook 14 22 ORG
tuna fishes 40 51 FOOD
Currently recognizing no entities..
Please let me know where I am doing it wrong. Thanks!
Upvotes: 5
Views: 1529
Reputation: 51
The cause for why you are losing the previous NER labels PERSON and ORG is the following line in your code:
doc1 = nlp_lg.make_doc(str(text))
This overwrites the ENTITY labels of your existing model.
spaCy documentation (https://spacy.io/usage/training#api) states:
The Example object contains annotated training data, also called the gold standard. It’s initialized with a Doc object that will hold the predictions, and another Doc object that holds the gold-standard annotations.
Further down in the documentation, you find this:
doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
So what they do here is to combine the annotated vocab from an existing/pretrained model (“nlp.vocab” coming from a pretrained model such as “en_core_web_lg”, etc.) with new vocab coming from your custom NER training data. They do this by “packing” both into a new instance of the Doc-class.
In order to “add” your additional NER label data to your existing model instead of “overwriting” the labels there, you must change your code line
from:
doc1 = nlp_lg.make_doc(str(text))
to:
doc1 = Doc(vocab=nlp_lg.vocab, words=[str(text)])
Here is a working example (where I have my train_ner()-method in a class):
def train_ner(self, trained_model_name_suffix: str = '', n_epochs=50, batch_size=20, dropout=0.05):
optimizer = self.nlp.resume_training()
# Exclude those pipeline components that shall not be trained
frozen_pipe_components = [pipe for pipe in self.nlp.pipe_names if pipe not in ['ner']]
# adding a (new) NER label if it is not yet in the pretrained model
existing_ner_labels = self.nlp.get_pipe('ner').labels
ner_labels_to_be_added = [label for label in self.ner_labels if label not in existing_ner_labels]
if len(ner_labels_to_be_added) > 0:
ner_component = self.nlp.get_pipe('ner')
for label in ner_labels_to_be_added:
ner_component.add_label(label)
with self.nlp.disable_pipes(*frozen_pipe_components):
for epoch in range(n_epochs):
random.shuffle(self.ner_train_data)
losses = {}
for batch in spacy.util.minibatch(self.ner_train_data, size=batch_size):
for text, entities in batch:
doc = Doc(vocab=self.nlp.vocab, words=[text])
example = Example.from_dict(predicted=doc, example_dict=entities)
self.nlp.update([example], drop=dropout, sgd=optimizer, losses=losses)
print(losses)
print("Final loss: ", losses)
self.save_trained_model(model_name_suffix=trained_model_name_suffix)
This works for me.
IMPORTANT REMARKS:
Make sure that your NER training data does the alignments (start, end index) correctly. If you have a NER train data item like this:
[ “Google is a great company”, {“entities”: [[1, 6, “ORG”]]} ]
then “Google” will become “oogle “ as the start index (1) is incorrect and the end index (6) refers to a blank space (as @polm23 already explained). This will throw an error in your train loop. Thus, the correct NER train data item in this case should be:
[ “Google is a great company”, {“entities”: [[0, 5 , “ORG”]]} ]
Make sure that you DO NOT label selectively, but label ALL entities in your NER train data. If you have this in your NER train data:
[ “Mark Zuckerberg is the CEO of Facebook”, {“entities”:[[31, 38, “ORG”]]} ]
then your model will learn that “Mark Zuckerberg” is NOT a PERSON. So you would need to do (at least):
[ “Mark Zuckerberg is the CEO of Facebook”, {“entities”: [ [0, 14, “PERSON”] , [31, 38, “ORG”] ]} ]
So my NER training data has the following format:
TRAIN_DATA = { “annotations” : [
[ “Google is a great company”, {“entities”: [[0, 5 , “ORG”]]} ],
[ “Mark Zuckerberg is the CEO of Facebook”, {“entities”: [ [0, 14, “PERSON”] , [31, 38, “ORG”] ]} ]
] }
spaCy 3.x recommends to train via the command line CLI. So what is discussed here is not the recommended way to train a model in spaCy 3.x. (But I will currently stick to this anyway as I do not like the CLI approach and also do not fully understand the configuration file “config.cfg” there). Be aware.
Upvotes: 2
Reputation: 15593
The "overwriting" you describe is called "catastrophic forgetting" and there's a post on the spaCy blog about it. There's no perfect workaround but we have a recent fix here.
Regarding your alignment error...
"('I work in Google.',)" with entities "[(9, 15, 'ORG')]"
Your character offsets are off.
"I work in Google."[9:15]
# => " Googl"
Maybe they're off by a constant value and you can fix this by just adding one to everything, but you need to look at your data to figure that out.
Upvotes: 1