What is the amount of training data needed for additional Named Entity Recognition with spaCy?

Question

I'm using the spaCy module to find name entities for input text. I am training the model to predict medical terms. I currently have access to 2 million medical notes, which I wrote a program to that annotates the notes.

I cross reference the medical notes against a pre-defined list of ~90 thousand terms, which is used for the annotation task. At the current pace of annotation, it takes about an hour and a half to annotate 10,000 notes. The way that annotation currently works, I end up with about 90% of the notes having no annotations (I'm currently working on getting a better list of cross-reference terms), so I take the ~1000 annotated notes and train the model on these.

I have checked and the model sort of responds to known annotated terms that it has seen (for example, the term tachycardia has been seen before from annotation, and will sometimes pick it up when the term shows up in the text).

This background might not be too relevant to my particular question, but I thought I would give a small bit of background to my current position.

I was wondering if anyone who has successfully trained a new entity in spaCy could give me some insight into their personal experience in the amount of training that was necessary to have at least somewhat reliable entity recognition.

Thanks!

gdaras · Accepted Answer

I trained the Named Entity Recognizer of the Greek language from scratch because no data was available, so I would try to give you a summary of the things I noticed for my case.

I trained the NER with Prodigy annotation tool. The answer to your question from my personal experience depends on the following things:

The number of labels you want your recognizer to be able to predict. It makes sense that when the numbers of labels (possible outputs) increases, it gets more difficult for your neural network to be able to distinguish them so the amount of data you need increases.
How different are the labels. For example, GPE and LOC tags are quite close and often used in the same context, so neural network was confusing them a lot at the beginning. It is advisable to provide more data related to labels that are close to each other.
The way of training. Pretty much there are two possibilities here:
- Fully annotated sentences. This means that you tell your neural network that there are no missing tags to your annotations.
- Partially annotated sentences. This means that you tell your neural network that your annotations are correct, but probably some tags are missing. This makes it harder for the network to rely on your data and for this reason, more data need to be provided.
Hyper-parameters. It is really important to fine tune your network in order to get the maximum out of your dataset.
The quality of the dataset. That means that if the dataset is representative of the things that you are going to ask your network to predict less data is required. However, if you are building a more general neural network (that would answer correctly in different contexts), more data is needed for that.

For the Greek model, I tried to predict among 6 labels that were distinct enough, I provided around 2000 fully annotated sentences and I spent a great amount of time fine-tuning.

Results: 70% F-measure, which is quite good for the complexity of the task.

Hope it helps!

What is the amount of training data needed for additional Named Entity Recognition with spaCy?

Answers (1)

Related Questions