Reputation: 2489
Given a list of words like this ['add', 'adds', 'adding', 'added', 'addition'], I want to stem all of them to the same word 'add'. That means stemming all different verb and noun forms of a word (but not its adjective and adverb forms) into one.
I couldn't find any stemmer that does that. The closest one I found is PorterStemmer, but it stems the above list to ['add', 'add', 'ad', 'ad', 'addit']
I'm not very experienced with stemming techniques. So, I want to ask if there's any available stemmer that does what I explains above? If not, do you have any suggestion on how to achieve that?
Many thanks,
Upvotes: 2
Views: 2362
Reputation: 44009
Lemmatization should lead to better results than stemming (source):
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Lemmatization is supported in NTLK as part of the nltk.stem package:
import nltk
l = nltk.stem.WordNetLemmatizer()
l.lemmatize('dogs') # -> 'dog'
l.lemmatize('addition') # -> 'addition'
s = nltk.stem.snowball.EnglishStemmer()
s.stem('dogs') # -> 'dog'
s.stem('addition') # -> 'addit'
If the lemmatizer does not recognize the word, it will not change it. One pitfall is that by default all words are considered nouns. To overwrite that behavior, you have to set the pos
argument, which is by default set to pos='n'
:
s.stem('better') # -> 'better'
l.lemmatize('better') # -> 'better'
l.lemmatize('better', pos='a') # -> 'good'
Upvotes: 2
Reputation: 15722
The idea of stemming is to reduce different forms of the same word to a single "base" form. That is not what you are asking for, so probably no existing stemmer is (at least not by purpose) fullfilling your needs. So the obvious solution for your problem is: If you have your own custom rules, you have to implement them.
You don't tell much about your requirement. Depending on your needs, you have to start from scratch. If porter stemmter comes close to your needs, but not in some special cases, you could hand code some overrides and use an existing stemmer for the other cases.
Upvotes: 0