El_Patrón
El_Patrón

Reputation: 533

Training own model and adding new entities with spacy

I have been trying to train a model with the same method as #887 is using, just for a test case. I have a question, what would be the best format for a training corpus to import in spacy. I have a text-file with a list of of entities that requires new entities for tagging. Let me explain my case, I follow the update.training script like this:

nlp = spacy.load('en_core_web_md', entity=False, parser=False)

ner= EntityRecognizer(nlp.vocab, entity_types=['FINANCE'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)

        nlp.tagger(doc)
        ner.update(doc, gold)
ner.model.end_training()

I add my training data as entity_offsets:

train_data = [
    ('Monetary contracts are financial instruments between parties', [(23, 44, 'FINANCE')])
]

This is working fine for the one example and new entity tag. Obviously I want to be able to add more than one example. The Idea is to create a text file with tagged sentences, the question is what format does spacy needs for training data, should I keep with entity_offset from the examples (this will be a very tedious task for 1000's of sentences) or is there another method to prepare the file, like:

financial instruments   FINANCE
contracts   FINANCE
Product OBJ
of O
Microsoft ORG
etc ...

And how can I pass the corpus in spcay using the mentioned method? Do I have to use the new created model or can I add the new entities to the old model, how can this be achieved?

UPDATE I managed to import a file with training data that would be recognized by the training method described above. The list will look like this:

Financial instruments can be real or virtual documents, 0 21 FINANCE
The number of units of the financial instrument, 27 47 FINANCE
or the number of derivative contracts in the transaction, 17 37 BANKING
Date and time when the transaction was executed, 23 34 ORDER
...

But the training is not performing well, I supposed this is due to the small training data. I get all entries in the test corpus tagged as FINANCE or all tagged by BANKING. How big does my train data need to be to get a better performance?

I guess I will have to annotate a bigger corpus for may training data. Can this be done in a different way?

What algorithm is behind the spacy Named Entity Recognizer?

Thanks for any help.

My Environment

spaCy version: 1.7.3 Platform: Windows-7-6.1.7601-SP1 Python version: 3.6.0 Installed models: en, en_core_web_md

Upvotes: 3

Views: 6986

Answers (1)

Nishank Mahore
Nishank Mahore

Reputation: 514

To provide training examples to the entity recogniser, you'll first need to create an instance of the GoldParse class. You can specify your annotations in a stand-off format or as token tags.

import spacy
import random
from spacy.gold import GoldParse
from spacy.language import EntityRecognizer

train_data = [
    ('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
    ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])
]

nlp = spacy.load('en', entity=False, parser=False)
ner = EntityRecognizer(nlp.vocab, entity_types=['PERSON', 'LOC'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)

        nlp.tagger(doc)
        ner.update(doc, gold)
ner.model.end_training()

Or to simplify this you can try this code

doc = Doc(nlp.vocab, [u'rats', u'make', u'good', u'pets'])
gold = GoldParse(doc, [u'U-ANIMAL', u'O', u'O', u'O'])
ner = EntityRecognizer(nlp.vocab, entity_types=['ANIMAL'])
ner.update(doc, gold)

Upvotes: 5

Related Questions