Reputation: 1557
I am new to NLTK, and I'm using stemmer function on stemming cases.
I have a simple example sentence to process, which is: "Turn on the lightin." I want to see if NLTK stemmer could help me filter out the typo "lightin". I've tested stemmer with "lighting", and snowBall stemmer can return the correct word "light" for me, but snowBall stemmer returns "lightin" in my test.
My stemming process is very trivial:
tokens = "Turn on the lightin"
for token in tokens:
print("SnowBall Lemmatizer: "+snowBallStemmer.stem(token))
According to NTLK's doc, snowBallStemmer could be used to stem English. I want to know why snowBallStemmer failed to stem "lightin" and what could I do to fix this.
Upvotes: 1
Views: 573
Reputation: 122250
Try running a spellchecker (e.g. pyenchant) before stemming:
>>> import enchant
>>> from nltk.stem import SnowballStemmer
>>> d = enchant.Dict("en_US")
>>> d.suggest('lightin')
['lighting', 'lighten', 'light in', 'light-in', 'lightning', 'lightering', 'sighting', 'light', 'flitting', 'Litton']
>>> snowball = SnowballStemmer('english')
>>> snowball.stem(d.suggest('lightin')[0])
u'light'
>>> sent = "Turn on the lightin".split()
>>> [snowball.stem(word if d.check(word) else d.suggest(word)[0]) for word in sent]
[u'turn', 'on', u'the', u'light']
Upvotes: 1