spacy noun-chunking creates unexpected lemma, pos, tag, and dep

Question

I am using spacy to parse documents and unfortunately I am unable to process noun chunks the way I would have expected them to be processed. Below is my code:

# Import spacy
import spacy
nlp = spacy.load("en_core_web_lg")

# Add noun chunking to the pipeline
merge_noun_chunks = nlp.create_pipe("merge_noun_chunks")
nlp.add_pipe(merge_noun_chunks)

# Process the document
docs = nlp.pipe(["The big dogs chased the fast cat"])

# Print out the tokens
for doc in docs:
    for token in doc:
        print("text: {}, lemma: {}, pos: {}, tag: {}, dep: {}".format(tname, token.text, token.lemma_, token.pos_, token.tag_, token.dep_))

The output I get is as follows:

text: The big dogs, lemma: the, pos: NOUN, tag: NNS, dep: nsubj
text: chased, lemma: chase, pos: VERB, tag: VBD, dep: ROOT
text: the fast cat, lemma: the, pos: NOUN, tag: NN, dep: dobj

The issue is in the first line of output, where "the big dogs" was parsed in an unexpected fashion: It create a "lemma" of "the" and indicated that it is a "pos" of "NOUN", a "tag" of "NNS", and a "dep" of "nsubj".

The output I was hoping to get is as follows:

text: The big dogs, lemma: the big dog, pos: NOUN, tag: NNS, dep: nsubj
text: chased, lemma: chase, pos: VERB, tag: VBD, dep: ROOT
text: the fast cat, lemma: the fast cat, pos: NOUN, tag: NN, dep: dobj

I expected a "lemma" would be the phrase "the big dog" with plural form changed to singular and the phrase would be "pos" of "NOUN", a "tag" of "NNS", and a "dep" of "nsubj".

Is this the correct behaviour, or am I using spacy incorrectly? If I am using spacy incorrectly, please let me know the correct manner in which to perform this task.

nmlq · Accepted Answer

There are a few things to consider here

Lemmatisation is token based
POS tagging and dependency parsing is predictive

You probably will get the big dog if you take the lemma_ attribute for each token. It does not update the token pos of you use the attribute.

Also, since dependency parsing and POS tagging is trained in a predictive model, it is not guaranteed to always be "right" from a human linguistic perspective.

Other than the lemma issue, it seems you are using spacy correct

spacy noun-chunking creates unexpected lemma, pos, tag, and dep

Answers (1)

Related Questions