Reputation: 174
I am trying to select stemming or lemmatizing or both for NLP processing. When I do stemming, it takes away important suffixes. When I do lemmatizing, it does not reduce some words.
For example,
from nltk.stem import LancasterStemmer, WordNetLemmatizer
tokens = ['carry', 'xy', 'known', 'size', 'may', 'use', 'column', 'value', 'contracting']
stemmer = LancasterStemmer()
stems = []
for wrd in tokens:
stems.append(stemmer.stem(wrd))
print(stems)
This is producing the following stemming output:
['carry', 'xy', 'known', 'siz', 'may', 'us', 'column', 'valu', 'contract']
where it rightly reduces contracting
to contract
but does not do the same for use
or size
or value
.
lem = WordNetLemmatizer()
lems []
for wrd in tokens:
lems.append(lem.lemmatize(wrd))
print(lems)
This is producing the following lemmatizing output:
['carry', 'xy', 'known', 'size', 'may', 'use', 'column', 'value', 'contracting']
where contracting
is not reduced to contract
but others are rightly captured.
Which would be the best option to go with if the data is either large or small?
Upvotes: 1
Views: 503
Reputation: 626754
You can use spacy
lemmatizer:
import spacy
nlp = spacy.load('en_core_web_sm')
tokens = ['carry', 'xy', 'known', 'size', 'may', 'use', 'column', 'value', 'contracting']
result = [nlp(t)[0].lemma_ for t in tokens]
print(result)
# => ['carry', 'xy', 'know', 'size', 'may', 'use', 'column', 'value', 'contract']
Upvotes: 1