Vance Pyton
Vance Pyton

Reputation: 174

How to stop Stemming from removing the letters that change the meaning of the word?

I am trying to select stemming or lemmatizing or both for NLP processing. When I do stemming, it takes away important suffixes. When I do lemmatizing, it does not reduce some words.

For example,

from nltk.stem import LancasterStemmer, WordNetLemmatizer

tokens = ['carry', 'xy', 'known', 'size', 'may', 'use', 'column', 'value', 'contracting']

stemmer = LancasterStemmer()
stems = []

for wrd in tokens:
   stems.append(stemmer.stem(wrd))
print(stems)

This is producing the following stemming output:

['carry', 'xy', 'known', 'siz', 'may', 'us', 'column', 'valu', 'contract']

where it rightly reduces contracting to contract but does not do the same for use or size or value.

lem = WordNetLemmatizer()
lems []
for wrd in tokens:
   lems.append(lem.lemmatize(wrd))
print(lems)

This is producing the following lemmatizing output:

['carry', 'xy', 'known', 'size', 'may', 'use', 'column', 'value', 'contracting']

where contracting is not reduced to contract but others are rightly captured.

Which would be the best option to go with if the data is either large or small?

Upvotes: 1

Views: 503

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626754

You can use spacy lemmatizer:

import spacy
nlp = spacy.load('en_core_web_sm')

tokens = ['carry', 'xy', 'known', 'size', 'may', 'use', 'column', 'value', 'contracting']
result = [nlp(t)[0].lemma_ for t in tokens]
print(result)
# => ['carry', 'xy', 'know', 'size', 'may', 'use', 'column', 'value', 'contract']

Upvotes: 1

Related Questions