Reputation: 1705
I am new to Spacy
and I would like to extract "all" the noun phrases from a sentence. I'm wondering how I can do it. I have the following code:
import spacy
nlp = spacy.load("en")
file = open("E:/test.txt", "r")
doc = nlp(file.read())
for np in doc.noun_chunks:
print(np.text)
But it returns only the base noun phrases, that is, phrases which don't have any other NP
in them. That is, for the following phrase, I get the result below:
Phrase: We try to explicitly describe the geometry of the edges of the images.
Result: We, the geometry, the edges, the images
.
Expected result: We, the geometry, the edges, the images, the geometry of the edges of the images, the edges of the images.
How can I get all the noun phrases, including nested phrases?
Upvotes: 12
Views: 14794
Reputation: 149
I have created a leightwight library called constituent-treelib for this purpose. It builds on benepar, spaCy and NLTK and provides a simple way to access all the constituents (aka phrases) of a given sentence. The following steps will guide you to this goal:
First, install the library via:
pip install constituent-treelib
Then, load the respective components from the library and build the tree...
from constituent_treelib import ConstituentTree, Language
# Define the language for the sentence as well as for the spaCy and benepar models
language = Language.English
# Define which specific SpaCy model should be used (default is Medium)
spacy_model_size = ConstituentTree.SpacyModelSize.Medium
# Create the pipeline (note, the required models will be downloaded and installed automatically)
nlp = ConstituentTree.create_pipeline(language, spacy_model_size)
# Your your sentence
sentence = 'We try to explicitly describe the geometry of the edges of the images.'
# Create the tree from where we are going to extract the desired noun phrases
tree = ConstituentTree(sentence, nlp)
This creates the following tree, which e.g., can be exported to a pdf via:
tree.export_tree("tree.pdf"):
Now, to extract the noun phrases, we first extract all phrases via:
all_phrases = tree.extract_all_phrases(min_words_in_phrases=1)
print(all_phrases)
>> {'PP': ['of the edges of the images', 'of the images'], 'NP': ['We', 'the geometry of the edges of the images', 'the geometry', 'the edges of the images', 'the edges', 'the images'], 'S': ['We try to explicitly describe the geometry of the edges of the images .', 'to explicitly describe the geometry of the edges of the images'], 'VP': ['try to explicitly describe the geometry of the edges of the images', 'to explicitly describe the geometry of the edges of the images', 'describe the geometry of the edges of the images'], 'ADVP': ['explicitly']}
However, we are only interested in noun phrases (NP), including nested NPs.
print(all_phrases['NP'])
Which returns your expected result:
>> ['We', 'the geometry of the edges of the images', 'the geometry', 'the edges of the images', 'the edges', 'the images']
Upvotes: 0
Reputation: 29
from spacy.matcher import Matcher
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('Features of the iphone applications include a beautiful design, smart search, automatic labels and optional voice responses.') ## sample text
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "NOUN", "OP": "*"}] ## getting all nouns
matcher.add("NOUN_PATTERN", [pattern])
print(matcher(doc, as_spans=True))
Getting all the nouns of your text. Using matcher and patterns are great to get the combination you want. Change the "en_core_web_sm"
if you want a better model "en_core_web_bm"
Upvotes: 2
Reputation: 27337
Please try this to get all nouns from a text:
import spacy
nlp = spacy.load("en_core_web_sm")
text = ("We try to explicitly describe the geometry of the edges of the images.")
doc = nlp(text)
print([chunk.text for chunk in doc.noun_chunks])
Upvotes: 1
Reputation: 8193
For every noun chunk you can also get the subtree beneath it.
Spacy provides two ways to access that:left_edge
and right edge
attributes and
the subtree
attribute, which returns a Token
iterator rather than a span.
Combining noun_chunks
and their subtree lead to some duplication which can be removed later.
Here is an example using the left_edge
and right edge
attributes
{np.text
for nc in doc.noun_chunks
for np in [
nc,
doc[
nc.root.left_edge.i
:nc.root.right_edge.i+1]]}
==>
{'We',
'the edges',
'the edges of the images',
'the geometry',
'the geometry of the edges of the images',
'the images'}
Upvotes: 3
Reputation: 1882
Please see commented code below to recursively combine the nouns. Code inspired by the Spacy Docs here
import spacy
nlp = spacy.load("en")
doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
for np in doc.noun_chunks: # use np instead of np.text
print(np)
print()
# code to recursively combine nouns
# 'We' is actually a pronoun but included in your question
# hence the token.pos_ == "PRON" part in the last if statement
# suggest you extract PRON separately like the noun-chunks above
index = 0
nounIndices = []
for token in doc:
# print(token.text, token.pos_, token.dep_, token.head.text)
if token.pos_ == 'NOUN':
nounIndices.append(index)
index = index + 1
print(nounIndices)
for idxValue in nounIndices:
doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
span = doc[doc[idxValue].left_edge.i : doc[idxValue].right_edge.i+1]
span.merge()
for token in doc:
if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.pos_ == "PRON":
print(token.text)
Upvotes: 12