GeorgeOfTheRF
GeorgeOfTheRF

Reputation: 8854

How to get probability of prediction per entity from Spacy NER model?

I used this official example code to train a NER model from scratch using my own training samples.

When I predict using this model on new text, I want to get the probability of prediction of each entity.

    # test the saved model
    print("Loading from", output_dir)
    nlp2 = spacy.load(output_dir)
    for text, _ in TRAIN_DATA:
        doc = nlp2(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

I am unable to find a method in Spacy to get the probability of prediction of each entity.

How do I get this probability from Spacy? I need it to apply a cutoff on it.

Upvotes: 7

Views: 6455

Answers (3)

Kyle Wayne Parker
Kyle Wayne Parker

Reputation: 11

NOTE: The beam_parse cannot be accessed via "nlp.entity.beam_parse()" as above, but instead may be called through the .get_pipe() function in v3:

import spacy

import pandas as pd

from collections import defaultdict

nlp = spacy.load(r'C:\path\to\en_core_web_sm\en_core_web_sm-3.7.1')

# ensure you specify the path to where you are storing your nlp model!!!

df = pd.read_csv('yourdataset.csv')   

# ensure you specify your dataset

df['Entities'] = [[] for _ in range(len(df))]  # initialize empty list 

for index, row in df.iterrows():

    input_phrase = row['INPUT COLUMN']

    try:

        input_phrase = str(input_phrase)

        response = nlp(input_phrase)

        beams = nlp.get_pipe("ner").beam_parse([response], beam_width=16, beam_density=0.0001)

        entity_scores = defaultdict(float)

        entries = []

        for beam in beams:

            for score, ents in nlp.get_pipe("ner").moves.get_beam_parses(beam):

                for start, end, label in ents:

                    entity_scores[(start, end, label)] += score

        for key in entity_scores:

            start, end, label = key

            score = entity_scores[key]

            entry = f"Label: {label}, Text: {response[start:end]}, Score: {score:.4f};" # use ; as delimiter for unpacking

            entries.append(entry)

        print(entries)

        df.at[index, 'Entities'] = entries

        df.to_csv('spacy_scores_take99999.csv', index=False)

    except:

        df.at[index, 'Entities'] = 'None'

df.to_csv('youroutputfile.csv', index=False)

This will write an array to your 'Entities' column that can be unpacked later by using the ; as a delimiter. The objects in the array are formatted as: 'Label: THING, Text: Whatever the identified object was, Score: 0.1234;' for each object.

Upvotes: 0

mbrunecky
mbrunecky

Reputation: 206

Sorry I do not have any better answer - I can only confirm that the 'beam' solution does provide some 'probabilities' - though in my case I am getting way too many entities with prob=1.0, even in cases where I can only shake my head and blame it on too little training data.

I find it quite strange that Spacy reports an 'entity' without having any confidence attached to it. I would assume there is some threshold to decide WHEN Spacy reports an entity and when it does NOT (perhaps I missed it). In my case, I see confidences 0.6 reported as 'this is an entity' while entity with confidence 0.001 is NOT reported.

In my use-case, the confidence is essential. For a given text, Spacy (and for example Google ML) report multiple instances of 'MY_ENTITY'. My code has to decide which ones are to be 'trusted' and which ones are false positive. I have yet to see IF the 'probability' returned by the above code has any practical value.

Upvotes: 3

DBaker
DBaker

Reputation: 2139

Getting the probabilities of prediction per entity from a Spacy NER model is not trivial. Here is the solution adapted from here :


import spacy
from collections import defaultdict

texts = ['John works at Microsoft.']

# Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.
beam_width = 16
# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.
beam_density = 0.0001 
nlp = spacy.load('en_core_web_md')


docs = list(nlp.pipe(texts, disable=['ner']))
beams = nlp.entity.beam_parse(docs, beam_width=beam_width, beam_density=beam_density)

for doc, beam in zip(docs, beams):
    entity_scores = defaultdict(float)
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

l= []
for k, v in entity_scores.items():
    l.append({'start': k[0], 'end': k[1], 'label': k[2], 'prob' : v} )

for a in sorted(l, key= lambda x: x['start']):
    print(a)

### Output: ####

{'start': 0, 'end': 1, 'label': 'PERSON', 'prob': 0.4054479906820232}
{'start': 0, 'end': 1, 'label': 'ORG', 'prob': 0.01002015005487447}
{'start': 0, 'end': 1, 'label': 'PRODUCT', 'prob': 0.0008592912552754791}
{'start': 0, 'end': 1, 'label': 'WORK_OF_ART', 'prob': 0.0007666755792166002}
{'start': 0, 'end': 1, 'label': 'NORP', 'prob': 0.00034931990870877333}
{'start': 0, 'end': 1, 'label': 'TIME', 'prob': 0.0002786051849320804}
{'start': 3, 'end': 4, 'label': 'ORG', 'prob': 0.9990115861687987}
{'start': 3, 'end': 4, 'label': 'PRODUCT', 'prob': 0.0003378157477046507}
{'start': 3, 'end': 4, 'label': 'FAC', 'prob': 8.249734411749544e-05}

Upvotes: 5

Related Questions