Reputation: 19670
Im using spacy to build a keyword extractor. The key word im looking for is OpTic Gaming
in the following text.
"The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017"
How can i parse OpTic Gaming
from this text. If use noun_chunks i get OpTic Gaming's main sponsors sponsors
and if i get tokens i get ["OpTic", "Gaming", "'s"].
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")
for chunk in doc.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
The company company nsubj was
OpTic Gaming's main sponsors sponsors pobj of
their first Call Call pobj to
Duty Championship Championship pobj of
Upvotes: 3
Views: 6129
Reputation: 1
import spacy
nlp = spacy.load("en_core_web_sm")
text = "New Delhi is a Capital of India"
doc = nlp(text)
full_entities = {}
for ent in doc.ents:
if ent.label_ in ["PERSON", "ORG", "GPE"] and " " in ent.text:
if ent.label_ not in full_entities:
full_entities[ent.label_] = []
full_entities[ent.label_].append(ent.text)
if not full_entities:
proper_nouns = [token.text for token in doc if token.pos_ == "PROPN"]
for i, token in enumerate(proper_nouns[:-1]):
if proper_nouns[i+1].istitle() and not token.endswith("."):
if "PERSON" not in full_entities:
full_entities["PERSON"] = []
full_entities["PERSON"].append(token + " " + proper_nouns[i+1])
print(full_entities)
Upvotes: 0
Reputation: 61
Spacy extracts Part-of-speech for you (proper noun, determinant, verb, etc.). You can access them at a token level with token.pos_
In your case:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")
for tok in doc:
print(tok, tok.pos_)
...
one NUM
of ADP
OpTic PROPN
Gaming PROPN
...
You can then filter on proper noun, group consecutive proper nouns, and slice the doc to get nominal groups:
def extract_proper_nouns(doc):
pos = [tok.i for tok in doc if tok.pos_ == "PROPN"]
consecutives = []
current = []
for elt in pos:
if len(current) == 0:
current.append(elt)
else:
if current[-1] == elt - 1:
current.append(elt)
else:
consecutives.append(current)
current = [elt]
if len(current) != 0:
consecutives.append(current)
return [doc[consecutive[0]:consecutive[-1]+1] for consecutive in consecutives]
extract_proper_nouns(doc)
[OpTic Gaming, Duty Championship]
More details here: https://spacy.io/usage/linguistic-features
Upvotes: 6