Reputation: 467
I'm using the spaCy module to find name entities for input text. I am training the model to predict medical terms. I currently have access to 2 million medical notes, which I wrote a program to that annotates the notes.
I cross reference the medical notes against a pre-defined list of ~90 thousand terms, which is used for the annotation task. At the current pace of annotation, it takes about an hour and a half to annotate 10,000 notes. The way that annotation currently works, I end up with about 90% of the notes having no annotations (I'm currently working on getting a better list of cross-reference terms), so I take the ~1000 annotated notes and train the model on these.
I have checked and the model sort of responds to known annotated terms that it has seen (for example, the term tachycardia
has been seen before from annotation, and will sometimes pick it up when the term shows up in the text).
This background might not be too relevant to my particular question, but I thought I would give a small bit of background to my current position.
I was wondering if anyone who has successfully trained a new entity in spaCy could give me some insight into their personal experience in the amount of training that was necessary to have at least somewhat reliable entity recognition.
Thanks!
Upvotes: 1
Views: 1421
Reputation: 10139
I trained the Named Entity Recognizer of the Greek language from scratch because no data was available, so I would try to give you a summary of the things I noticed for my case.
I trained the NER with Prodigy annotation tool. The answer to your question from my personal experience depends on the following things:
For the Greek model, I tried to predict among 6 labels that were distinct enough, I provided around 2000 fully annotated sentences and I spent a great amount of time fine-tuning.
Results: 70% F-measure, which is quite good for the complexity of the task.
Hope it helps!
Upvotes: 2