Reputation: 4991
I am trying to evaluate a trained NER Model created using spacy lib. Normally for these kind of problems you can use f1 score (a ratio between precision and recall). I could not find in the documentation an accuracy function for a trained NER model.
I am not sure if it's correct but I am trying to do it with the following way(example) and using f1_score
from sklearn
:
from sklearn.metrics import f1_score
import spacy
from spacy.gold import GoldParse
nlp = spacy.load("en") #load NER model
test_text = "my name is John" # text to test accuracy
doc_to_test = nlp(test_text) # transform the text to spacy doc format
# we create a golden doc where we know the tagged entity for the text to be tested
doc_gold_text= nlp.make_doc(test_text)
entity_offsets_of_gold_text = [(11, 15,"PERSON")]
gold = GoldParse(doc_gold_text, entities=entity_offsets_of_gold_text)
# bring the data in a format acceptable for sklearn f1 function
y_true = ["PERSON" if "PERSON" in x else 'O' for x in gold.ner]
y_predicted = [x.ent_type_ if x.ent_type_ !='' else 'O' for x in doc_to_test]
f1_score(y_true, y_predicted, average='macro')`[1]
> 1.0
Any thoughts are or insights are useful.
Upvotes: 28
Views: 26249
Reputation: 41
I searched for many solutions on the internet but failed to find any working solution. Now that I was able to figure out the root of the problem, I am sharing my code, similar to the original question. I hope someone can still find it useful. It works with SpaCy V3.3.
from spacy.scorer import Scorer
from spacy.training import Example
def evaluate(ner_model, samples):
scorer = Scorer(ner_model)
example = []
for sample in samples:
pred = ner_model(sample['text'])
print(pred, sample['entities'])
temp_ex = Example.from_dict(pred, {'entities': sample['entities']})
example.append(temp_ex)
scores = scorer.score(example)
return scores
Note: samples should be a valid spacy v3 formatted JSON data like below:
{'text': '#Causes - Quinsy - CA0K.1\nPeri Tonsillar Abscess is usually a complication of an untreated or partially treated acute tonsillitis. The infection, in these cases, spreads to the peritonsillar area (peritonsillitis). This region comprises loose connective tissue and is hence susceptible to formation of abscess.', 'entities': [(10, 16, 'Disease_E'), (26, 48, 'Disease_E'), (112, 129, 'Complication_E'), (177, 213, 'Anatomy_E'), (237, 260, 'Anatomy_E'), (302, 309, 'Disease_E')]}
Upvotes: 1
Reputation: 718
This is how I used to calculate accuracy for my Spacy's Custom NER model
def flat_accuracy(text, annotations):
actual_ents = [ents[2] for ents in annotations]
prediction = nlp_ner(text)
pred_ents = [ent.text for ent in prediction.ents]
return 1 if actual_ents == pred_ents else 0
predict_points = sum(flat_accuracy(test_text[0], test_text[1]) for test_text in examples)
output = (predict_points/len(examples)) * 100
output --> 82%
Upvotes: 1
Reputation: 15593
Note that in spaCy v3 there is an evaluate
command you can use easily from the command line instead of writing custom code to handle things.
Upvotes: 3
Reputation: 525
since i faced the same problem, i am going to post here the code for the example showed in the accepted answer, but for spacy V3:
import spacy
from spacy.scorer import Scorer
from spacy.tokens import Doc
from spacy.training.example import Example
examples = [
('Who is Shaka Khan?',
{(7, 17, 'PERSON')}),
('I like London and Berlin.',
{(7, 13, 'LOC'), (18, 24, 'LOC')})
]
def evaluate(ner_model, examples):
scorer = Scorer()
example = []
for input_, annot in examples:
pred = ner_model(input_)
print(pred,annot)
temp = Example.from_dict(pred, dict.fromkeys(annot))
example.append(temp)
scores = scorer.score(example)
return scores
ner_model = spacy.load('en_core_web_sm') # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
print(results)
Breaking changes ocurred because libraries such as goldParse deprecated
I believe the part of the answer about metrics is still valid
Upvotes: 6
Reputation: 4991
You can find different metrics including F-score, recall and precision in spaCy/scorer.py.
This example shows how you can use it:
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer
def evaluate(ner_model, examples):
scorer = Scorer()
for input_, annot in examples:
doc_gold_text = ner_model.make_doc(input_)
gold = GoldParse(doc_gold_text, entities=annot)
pred_value = ner_model(input_)
scorer.score(pred_value, gold)
return scorer.scores
# example run
examples = [
('Who is Shaka Khan?',
[(7, 17, 'PERSON')]),
('I like London and Berlin.',
[(7, 13, 'LOC'), (18, 24, 'LOC')])
]
ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
The scorer.scores
returns multiple scores. When running the example, the result looks like this: (Note the low scores occuring because the examples classify London and Berlin as 'LOC' while the model classifies them as 'GPE'. You can figure this out by looking at the ents_per_type
.)
{'uas': 0.0, 'las': 0.0, 'las_per_type': {'attr': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'root': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'compound': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'nsubj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'dobj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'cc': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'conj': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'ents_p': 33.33333333333333, 'ents_r': 33.33333333333333, 'ents_f': 33.33333333333333, 'ents_per_type': {'PERSON': {'p': 100.0, 'r': 100.0, 'f': 100.0}, 'LOC': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'GPE': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'tags_acc': 0.0, 'token_acc': 100.0, 'textcat_score': 0.0, 'textcats_per_cat': {}}
The example is taken from a spaCy example on github (link does not work anymore). It was last tested with spacy 2.2.4.
Upvotes: 38