Reputation: 91
My spacy version is 2.3.7. I have an existing trained custom NER model with NER and Entity Ruler pipes. I want to update and retrain this existing pipeline.
The code to create the entity ruler pipe was as follows-
ruler = EntityRuler(nlp)
for i in patt_dict:
ruler.add_patterns(i)
nlp.add_pipe(ruler, name = "entity_ruler")
Where patt_dict
is the original patterns dictionary I had made.
Now, after finishing the training, now I have more input data and want to train the model more with the new input data.
How can I modify the above code to add more of patterns dictionary to the entity ruler when I load the spacy model later and want to retrain it with more input data?
Upvotes: 2
Views: 1555
Reputation: 121
I'd also recommend polm23's suggestion to retrain fully in this situation.
Here is why: we are asking the model to produce inferences based on weights derived from matching input data to labels/classes/whatever over and over. These weights are toggled via backprop to reduce the error gradient vis a vis the labels/classes/whatever. When the weights, given whatever data, produce errors as close to 0 as possible eventually the loss reaches an equilibrium or you just call it via hyper parameters (epochs).
However, by only using the new data, you will only optimize for that specific data. The model will generalize poorly, but really only because it is learning exactly what you asked it to learn and nothing else. When you add in that retraining fully is usually not the end of the world, it just kinda makes sense as a best practice.
(This is my imperfect understanding of the catastrophic forgetting issue, happy to learn more if other's have deeper knowledge).
Upvotes: 2
Reputation: 15623
It is generally better to retrain from scratch. If you train only on new data you are likely to run into "catastrophic forgetting", where the model forgets anything not in the new data.
This is covered in detail in this spaCy blog post. As of v3 the approach outlined there is available in spaCy, but it's still experimental and needs some work. In any case, it's still kind of a workaround, and the best thing is to train from scratch with all data.
Upvotes: 3