Kay
Kay

Reputation: 19670

How to find proper noun using spacy nlp

Im using spacy to build a keyword extractor. The key word im looking for is OpTic Gaming in the following text.

"The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017"

How can i parse OpTic Gaming from this text. If use noun_chunks i get OpTic Gaming's main sponsors sponsors and if i get tokens i get ["OpTic", "Gaming", "'s"].

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

The company company nsubj was

OpTic Gaming's main sponsors sponsors pobj of

their first Call Call pobj to

Duty Championship Championship pobj of

Upvotes: 3

Views: 6129

Answers (2)

Pankaj
Pankaj

Reputation: 1

import spacy

nlp = spacy.load("en_core_web_sm")
text = "New Delhi is a Capital of India"

doc = nlp(text)

full_entities = {}
for ent in doc.ents:
    if ent.label_ in ["PERSON", "ORG", "GPE"] and " " in ent.text:
        if ent.label_ not in full_entities:
            full_entities[ent.label_] = []
        full_entities[ent.label_].append(ent.text)

if not full_entities:
    proper_nouns = [token.text for token in doc if token.pos_ == "PROPN"]
    for i, token in enumerate(proper_nouns[:-1]):
        if proper_nouns[i+1].istitle() and not token.endswith("."):
            if "PERSON" not in full_entities:
                full_entities["PERSON"] = []
            full_entities["PERSON"].append(token + " " + proper_nouns[i+1])

print(full_entities)

Upvotes: 0

T. Jeanneau
T. Jeanneau

Reputation: 61

Spacy extracts Part-of-speech for you (proper noun, determinant, verb, etc.). You can access them at a token level with token.pos_

In your case:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for tok in doc:
    print(tok, tok.pos_)

...

one NUM

of ADP

OpTic PROPN

Gaming PROPN

...

You can then filter on proper noun, group consecutive proper nouns, and slice the doc to get nominal groups:

def extract_proper_nouns(doc):
    pos = [tok.i for tok in doc if tok.pos_ == "PROPN"]
    consecutives = []
    current = []
    for elt in pos:
        if len(current) == 0:
            current.append(elt)
        else:
            if current[-1] == elt - 1:
                current.append(elt)
            else:
                consecutives.append(current)
                current = [elt]
    if len(current) != 0:
        consecutives.append(current)
    return [doc[consecutive[0]:consecutive[-1]+1] for consecutive in consecutives]

extract_proper_nouns(doc)

[OpTic Gaming, Duty Championship]

More details here: https://spacy.io/usage/linguistic-features

Upvotes: 6

Related Questions