Neroksi
Neroksi

Reputation: 1398

Snowball Stemmer : poor french stemming

I'm dealing with some nlp tasks. My inputs are french text and so, only Snowball Stemmer is usable in my context. But, unfortunately, it keeps giving me poor stems as it wouldn't remove even plural "s" or silent e. Below is some example:

from nltk.stem import SnowballStemmer
SnowballStemmer("french").stem("pommes, noisettes dorées & moelleuses, la boîte de 350g")
Output: 'pommes, noisettes dorées & moelleuses, la boîte de 350g'

Upvotes: 1

Views: 4015

Answers (1)

alvas
alvas

Reputation: 121992

Stemmers stem words not sentences, so tokenize the sentence and stem the tokens individually.

>>> from nltk import word_tokenize
>>> from nltk.stem import SnowballStemmer

>>> fr = SnowballStemmer('french')

>>> sent = "pommes, noisettes dorées & moelleuses, la boîte de 350g"
>>> word_tokenize(sent)
['pommes', ',', 'noisettes', 'dorées', '&', 'moelleuses', ',', 'la', 'boîte', 'de', '350g']

>>> [fr.stem(word) for word in word_tokenize(sent)]
['pomm', ',', 'noiset', 'dor', '&', 'moelleux', ',', 'la', 'boît', 'de', '350g']

>>> ' '.join([fr.stem(word) for word in word_tokenize(sent)])
'pomm , noiset dor & moelleux , la boît de 350g'

Upvotes: 6

Related Questions