Reputation: 1382
In NLTK, how do I traverse a parsed sentence to return a list of noun phrase strings?
I have two goals:
(1) Create the list of Noun Phrases instead of printing them using the 'traverse()' method. I presently use StringIO to record the output of the existing traverse() method. That is not an acceptable solution.
(2) De-parse the Noun Phrase string so: '(NP Michael/NNP Jackson/NNP)' becomes 'Michael Jackson'. Is there a method in NLTK to de-parse?
The NLTK documentation recommends using traverse() to view the Noun Phrase, but how do I capture the 't' in this recursive method so I generate a list of string Noun Phrases?
from nltk.tag import pos_tag
def traverse(t):
try:
t.label()
except AttributeError:
return
else:
if t.label() == 'NP': print(t) # or do something else
else:
for child in t:
traverse(child)
def nounPhrase(tagged_sent):
# Tag sentence for part of speech
tagged_sent = pos_tag(sentence.split()) # List of tuples with [(Word, PartOfSpeech)]
# Define several tag patterns
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and noun
{<NNP>+} # chunk sequences of proper nouns
{<NN>+} # chunk consecutive nouns
"""
cp = nltk.RegexpParser(grammar) # Define Parser
SentenceTree = cp.parse(tagged_sent)
NounPhrases = traverse(SentenceTree) # collect Noun Phrase
return(NounPhrases)
sentence = "Michael Jackson likes to eat at McDonalds"
tagged_sent = pos_tag(sentence.split())
NP = nounPhrase(tagged_sent)
print(NP)
This presently prints:
(NP Michael/NNP Jackson/NNP)
(NP McDonalds/NNP)
and stores 'None' to NP
Upvotes: 8
Views: 3073
Reputation: 149
An alternative possibility to extract noun phrases is to use the Constituent-Treelib library, which can be installed via: pip install constituent-treelib
.
Using this library, we need to perform the following steps to extract the (noun) phrases:
from constituent_treelib import ConstituentTree, BracketedTree
# First, we define the parsed sentence from where we want to extract phrases
parsed_sentence = "(S (NP (NNP Michael) (NNP Jackson)) (VP (VBZ likes) (S (VP (TO to) (VP (VB eat) (PP (IN at) (NP (NNPS McDonalds))))))))"
# ...and wrap the parsed sentence into a BracketedTree object
parsed_sentence = BracketedTree(parsed_sentence)
# Next, we define the language that should be considered with respect to the underlying models
language = ConstituentTree.Language.English
# You can also specify the desired model for the language ("Small" is selected by default)
spacy_model_size = ConstituentTree.SpacyModelSize.Large
# Now, we create the neccesary NLP pipeline, which is required to create a ConstituentTree object
nlp = ConstituentTree.create_pipeline(language, spacy_model_size)
# If you wish, you can instruct the library to download and install the models automatically
# nlp = ConstituentTree.create_pipeline(language, spacy_model_size, download_models=True)
# Now we can instantiate a ConstituentTree object and pass it the parsed sentence as well as the NLP pipeline
tree = ConstituentTree(parsed_sentence, nlp)
# Finally, we can extract all phrases from the tree
all_phrases = tree.extract_all_phrases(avoid_nested_phrases=True)
>>> {'S': ['Michael Jackson likes to eat at McDonalds'],
>>> 'NP': ['Michael Jackson'],
>>> 'VP': ['likes to eat at McDonalds'],
>>> 'PP': ['at McDonalds']}
# ...or restrict them only to noun phrases
noun_phrases = all_phrases['NP']
>>> ['Michael Jackson']
In case you also want to visualize the tree, you can do it as follows:
tree.export_tree('my_tree.pdf')
Result:
Upvotes: 0
Reputation: 122142
def extract_np(psent):
for subtree in psent.subtrees():
if subtree.label() == 'NP':
yield ' '.join(word for word, tag in subtree.leaves())
cp = nltk.RegexpParser(grammar)
parsed_sent = cp.parse(tagged_sent)
for npstr in extract_np(parsed_sent):
print (npstr)
Upvotes: 8