Reputation: 7261
I am using spacy to parse documents and unfortunately I am unable to process noun chunks the way I would have expected them to be processed. Below is my code:
# Import spacy
import spacy
nlp = spacy.load("en_core_web_lg")
# Add noun chunking to the pipeline
merge_noun_chunks = nlp.create_pipe("merge_noun_chunks")
nlp.add_pipe(merge_noun_chunks)
# Process the document
docs = nlp.pipe(["The big dogs chased the fast cat"])
# Print out the tokens
for doc in docs:
for token in doc:
print("text: {}, lemma: {}, pos: {}, tag: {}, dep: {}".format(tname, token.text, token.lemma_, token.pos_, token.tag_, token.dep_))
The output I get is as follows:
text: The big dogs, lemma: the, pos: NOUN, tag: NNS, dep: nsubj
text: chased, lemma: chase, pos: VERB, tag: VBD, dep: ROOT
text: the fast cat, lemma: the, pos: NOUN, tag: NN, dep: dobj
The issue is in the first line of output, where "the big dogs" was parsed in an unexpected fashion: It create a "lemma" of "the" and indicated that it is a "pos" of "NOUN", a "tag" of "NNS", and a "dep" of "nsubj".
The output I was hoping to get is as follows:
text: The big dogs, lemma: the big dog, pos: NOUN, tag: NNS, dep: nsubj
text: chased, lemma: chase, pos: VERB, tag: VBD, dep: ROOT
text: the fast cat, lemma: the fast cat, pos: NOUN, tag: NN, dep: dobj
I expected a "lemma" would be the phrase "the big dog" with plural form changed to singular and the phrase would be "pos" of "NOUN", a "tag" of "NNS", and a "dep" of "nsubj".
Is this the correct behaviour, or am I using spacy incorrectly? If I am using spacy incorrectly, please let me know the correct manner in which to perform this task.
Upvotes: 1
Views: 255
Reputation: 3154
There are a few things to consider here
You probably will get the big dog
if you take the lemma_
attribute for each token. It does not update the token pos of you use the attribute.
Also, since dependency parsing and POS tagging is trained in a predictive model, it is not guaranteed to always be "right" from a human linguistic perspective.
Other than the lemma issue, it seems you are using spacy correct
Upvotes: 1