How to stop Stemming from removing the letters that change the meaning of the word?

Question

I am trying to select stemming or lemmatizing or both for NLP processing. When I do stemming, it takes away important suffixes. When I do lemmatizing, it does not reduce some words.

For example,

from nltk.stem import LancasterStemmer, WordNetLemmatizer

tokens = ['carry', 'xy', 'known', 'size', 'may', 'use', 'column', 'value', 'contracting']

stemmer = LancasterStemmer()
stems = []

for wrd in tokens:
   stems.append(stemmer.stem(wrd))
print(stems)

This is producing the following stemming output:

['carry', 'xy', 'known', 'siz', 'may', 'us', 'column', 'valu', 'contract']

where it rightly reduces contracting to contract but does not do the same for use or size or value.

lem = WordNetLemmatizer()
lems []
for wrd in tokens:
   lems.append(lem.lemmatize(wrd))
print(lems)

This is producing the following lemmatizing output:

['carry', 'xy', 'known', 'size', 'may', 'use', 'column', 'value', 'contracting']

where contracting is not reduced to contract but others are rightly captured.

Which would be the best option to go with if the data is either large or small?

How to stop Stemming from removing the letters that change the meaning of the word?

Answers (1)

Related Questions