Piyush S. Wanare
Piyush S. Wanare

Reputation: 4933

Training custom NER model

I have been training my NER model on some text and trying to find cities in that with custom entities.

Example:-

    ('paragraph Designated Offices Party A New York Party B Delaware paragraph pricing source calculation Market Value shall generally accepted pricing source reasonably agreed parties paragraph Spot rate Spot Rate specified paragraph reasonably agreed parties',
  {'entities': [(37, 41, 'DesignatedBankLoc'),(54, 62, 'CounterpartyBankLoc')]})

I am looking for 2 entities here DesignatedBankLoc and CounterpartyBankLoc. There can be multiple entities also for individual text.

currently I am training on 60 rows of data as follows:

import spacy
import random
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)


    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            # print (ent[2])
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp


prdnlp = train_spacy(TRAIN_DATA, 100)

My problem is:-

Model is predicting correct when input is different/same pattern of text contains trained cities. Model is not predicting for any of the entities even if same/different pattern of text but different cities which never occurs in training data set.

Please suggest me why it is happening please make me understand the concept how it is getting train?

Upvotes: 1

Views: 809

Answers (1)

Valentin Calomme
Valentin Calomme

Reputation: 618

Based on experience, you have 60 rows of data and train for 100 iterations. You are overfitting on the value of the entities as opposed to their position.

To check this, try to inject the city names at random places in a sentence and see what happens. If the algorithm tags them, you're likely overfitting.

There are two solutions:

  • Create more training data with more varied values for these entities
  • Test for different number of iterations

Upvotes: 2

Related Questions