Reputation: 678
I am recently have known with Spacy
and quite interested in this Python library. However, in my specification, I intend to extract compound noun-adjective pairs as a key phrase from the input sentence. I think Spacy
provides a lot of utilities to work with NLP task, but didn't find a satisfied clue for my desired task. I looked into a very similar post in SO
, related post, and solution is not very efficient and doesn't work for custom input sentence.
Here is some of the input sentence:
sentence_1="My problem was with DELL Customer Service"
sentence_2="Obviously one of the most important features of any computer is the human interface."
sentence_3="The battery life seems to be very good and have had no issues with it."
Here is the code that I tried:
import spacy, en_core_web_sm
nlp=en_core_web_sm.load()
def get_compound_nn_adj(doc):
compounds_nn_pairs = []
parsed=nlp(doc)
compounds = [token for token in sent if token.dep_ == 'compound']
compounds = [nc for nc in compounds if nc.i == 0 or sent[nc.i - 1].dep_ != 'compound']
if compounds:
for token in compounds:
pair_1, pair_2 = (False, False)
noun = sent[token.i:token.head.i + 1]
pair_1 = noun
if noun.root.dep_ == 'nsubj':
adj_list = [rt for rt in noun.root.head.rights if rt.pos_ == 'ADJ']
if adj_list:
pair_2 = adj_list[0]
if noun.root.dep_ == 'dobj':
verb_root = [vb for vb in noun.root.ancestors if vb.pos_ == 'VERB']
if verb_root:
pair_2 = verb_root[0]
if pair_1 and pair_2:
compounds_nn_pairs.append(pair_1, pair_2)
return compounds_nn_pairs
I am speculating that what kind of modification should be applied above helper function because it didn't return my expected compound noun-adjective pairs. Is there anyone who have good experiences with Spacy
? How can I improve above sketch solution? Any better idea?
Desired output:
I am expecting to get compound noun-adjective pairs from each of input sentence as follow:
desired_output_1="DELL Customer Service"
desired_output_2="human interface"
desired_output_3="battery life"
Is there any way I could get the expected output? what kind of update will be needed for the above implementation? Any more thoughts? Thanks in advance!
Upvotes: 0
Views: 2814
Reputation: 11
Extending the answers above I would like to add that you can also get context with the word inside just checking at first children in the left and then in the right.
doc = nlp('this is your sentence here')
for w in doc:
if w.pos_ == "NOUN":
context = [j for j in w.lefts if j.pos_ in ["ADJ", "NOUN"]]
context.append(w.text)
context.extend([j for j in w.rights if j.pos_ in ["ADJ", "NOUN"]])
You can also check the whole subtree with attribute of token.subtree, but in my cases it performed worse and showed nearly the whole sentence.
Upvotes: 1
Reputation: 161
I suspect that this has to be handled with a database of compound nouns. The status of "compound noun" comes from commonality of usage. So, maybe the various n-gram databases (like Google's) could be a source.
Upvotes: 1
Reputation: 610
It looks like spaCy is only detecting compound relations in sentences 1 and 3, and treating 2's as an amod
relation. (Here's some quick code to check its parse: [(i, i.pos_, i.dep_) for i in nlp(sentence_1)]
).
To get the compounds out of 1 and 3, try this:
for i in nlp(sentence_1):
if i.pos_ in ["NOUN", "PROPN"]:
comps = [j for j in i.children if j.dep_ == "compound"]
if comps:
print(comps, i)
For each noun or proper noun in the sentence, it checks its subtree for compound
relations.
To cast a wider net that also picks up adjectives, you could look for adjectives and nouns in the word's subtree, not just compounds:
for i in nlp(sentence_2):
if i.pos_ in ["NOUN", "PROPN"]:
comps = [j for j in i.children if j.pos_ in ["ADJ", "NOUN", "PROPN"]]
if comps:
print(comps, i)
Upvotes: 2