Bert Hanz
Bert Hanz

Reputation: 447

spaCy Named Entity Recognition Not Recognizing Product Entities Such as Foods

I'm using spaCy's Named Entity Recognition to figure out the food word in a sentence. This is the code that I have:

import spacy 
  
nlp = spacy.load('en_core_web_sm') 
  
sentence = "I like to eat pizza."
  
doc = nlp(sentence) 
  
for ent in doc.ents: 
    print(ent.text, ent.label_)

Why is it not printing "pizza"? According to spaCy's entity types, foods belongs to the PRODUCT entity type so shouldn't "pizza" be printed for the ent.text and PRODUCT be printed for the ent.label?

Upvotes: 0

Views: 1616

Answers (1)

Eurico Covas
Eurico Covas

Reputation: 11

I had the same issue and trained spacy with a few examples.

So, grab a few sentences (even 3-4 will start to work), manually extract the products into a list, then you will have a dictionary of texts and lists of products. Then adapt this code

def getSpans(ner_model=None, products=[], nameForNewLabel = 'PRODUCTS', doc=None):
    # create patterns
    patterns = [ner_model(products) for products in products] 
    # matches them, what about overlapping?
    matcher = PhraseMatcher(ner_model.vocab)
    matcher.add(nameForNewLabel, None, *patterns)  # add patterns to matcher
    matches = matcher(doc)
    # now create spans
    spans=[]
    for match_id, start, end in matches:
        # create a new Span for each match and use the match_id (PRODUCTS) as the label
        span = doc[start:end]  # The matched span
        print(span.text, span.start_char,span.end_char, span.label_, "'"+doc.text[span.start_char:span.end_char]+"'", doc.text[span.start_char:span.end_char] in products)
        # now create open span
        span = Span(doc, start, end, label=match_id)
        # add to spans
        spans.append(span)

    # filter spans for that company,description of company
    # Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or 
    # when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.
    filtered_spans = filter_spans(spans)
    doc.ents = filtered_spans
    #create example and add to dataset list of examples to return
    eg=Example(doc,doc)
    return eg

where

doc = ner_model.make_doc(text)

and

ner_model = spacy.blank('en')  # create blank Language class

Then train the model. Once trained for e.g. for 200 epochs with batch_size = max(number examples) you will see it will work.

I cannot share my entire code as I am using it for products in a private equity AI company, but with the above I am sure you can get there.

Upvotes: 0

Related Questions