Reputation: 151
Usecase
I am creating program to annotate a sentence from a paragraph that contains specific words for quick contract analysis but I am noticing that the annotation is not not highlighting correct words. Please see training data below in the code and testing data is below.
I would expect my code to identify as
"MrAsells a product that he warrants for at least one year" when it finds the term "Warranty'
"The minimum payment terms acceptable to our firm are Net 90 days" when it sees Payment terms based on trained data. However the output of code is:
However the output is
Marketing below does not align at all with "Payment Terms"
WORK_OF_ART -- MrX
Marketing -- MrAsells a product that he warrants for at least one year and is hopeful that he receives the payment for the product within 70 days.
Marketing -- MrB is not allowed to share any logo that he might use during the project phase with other clients as a promotional item.
#Testing Data that is imported using Doc2x.
"MrB expects MrX to take responsibility for owning client data to highest standard. CompanyA is an affiliate of CompanyB. MrAsells a product that he warrants for at least one year and is hopeful that he receives the payment for the product within 70 days. MrB is not allowed to share any logo that he might use during the project phase with other clients as a promotional item"
#Code
import spacy
import random
from spacy.training import Example
import docx2txt
from spacy import displacy
import pandas as pd
import docx
#nlp = spacy.blank('en')
nlp = spacy.load('en_core_web_sm')
ner=nlp.get_pipe("ner")
if 'ner' not in nlp.pipe_names:
ner_pipe = nlp.create_pipe('ner')
nlp.add_pipe(ner_pipe, last=True)
else:
ner_pipe = nlp.get_pipe('ner')
TRAIN_DATA = [("The minimum payment terms acceptable to our firm are Net 90 days.",{"entities":[(0,62,"Payment Terms")]}),
("We do not allow anyone to share our logo for marketing purpose.",{"entities":[(0,63,"Marketing")]}),
("We expect that the firm will honor our warranty requirement of atleast one year.",{"entities":[(39,48,"Warranty")]})]
for _,annotations in TRAIN_DATA:
for entity in annotations['entities']:
ner.add_label(entity[2])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.create_optimizer()
for iteration in range(200):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example],drop=0.3)
# test the trained model # add some dummy sentences with many NERs
test_text = docx2txt.process('C:/users/Siddk/Testing.docx')
doc = nlp(test_text)
for ent in doc.ents:
print(ent.label_, " -- ", ent.text)
Upvotes: 2
Views: 391
Reputation: 15623
In your training data you have three examples. I assume that is an example for Stack Overflow, but just in case: you cannot train a model from three examples. You need at least hundreds.
More generally you cannot use NER to tag whole sentences, especially spaCy NER. From the docs:
The transition-based algorithm also assumes that the most decisive information about your entities will be close to their initial tokens. If your entities are long and characterized by tokens in their middle, the component will likely not be a good fit for your task.
Of your three examples, in two cases you have labeled whole sentences. The model will not be able to learn this.
There are a couple of things you can do instead. One is to use a text classifier on sentences. Another is to look at the SpanCategorizer, which will be released soon as an experimental feature.
I would suggest you use the classification approach, though - the beginning/ends of spans aren't really important in your examples, it seems like you just want to classify sentences.
Upvotes: 1