Student
Student

Reputation: 31

Lemmatization of words using spacy and nltk not giving correct lemma

I want to get the lemmatized words of the words in list given below:

(eg)

words = ['Funnier','Funniest','mightiest','tighter']

When I do spacy,

import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
    print(items.lemma_)

I got the lemmas like:

Funnier
Funniest
mighty
tight 

When I go for nltk WordNetLemmatizer

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token))

I got:

Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter

Anyone help for this.

Thanks.

Upvotes: 3

Views: 5538

Answers (1)

abheet22
abheet22

Reputation: 470

Lemmatisation is totally dependent on the part of speech tag that you are using while getting the lemma of the particular word.

# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']

# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best

The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.

Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().

Sometimes, the same word can have a multiple lemmas based on the meaning / context.

print(lemmatizer.lemmatize("stripes", 'v'))  
#> strip

print(lemmatizer.lemmatize("stripes", 'n'))  
#> stripe

For the above example(), specify the corresponding pos tag:

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token, wordnet.ADJ_SAT))

Upvotes: 3

Related Questions