Reputation: 23
I am newly getting into NLP, Python, and posting on Stackoverflow at the same time, so please be patient with me if I might seem ignorant :).
I am using SnowballStemmer in Python's NLTK in order to stem words for textual analysis. While lemmatization seems to understem my tokens, the snowball porter2 stemmer, which I read is mostly preferred to the basic porter stemmer, overstems my tokens. I am analyzing tweets including many names and probably also places and other words which should not be stemmed, like: hillary, hannity, president, which are now reduced to hillari, hanniti, and presid (you probably guessed already whose tweets I am analyzing).
Is there an easy way to exclude certain terms from stemming? Conversely, I could also merely lemmatize tokens and include a rule for common suffixes like -ed, -s, …. Another idea might be to merely stem verbs and adjectives as well as nouns ending in s. That might also be close enough…
I am using below code as of now:
# LEMMATIZE AND STEM WORDS
from nltk.stem.snowball import EnglishStemmer
lemmatizer = nltk.stem.WordNetLemmatizer()
snowball = EnglishStemmer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in text]
def snowball_stemmer(text):
return [snowball.stem(w) for w in text]
# APPLY FUNCTIONS
tweets['text_snowball'] = tweets.text_processed.apply(snowball_stemmer)
tweets['text_lemma'] = tweets.text_processed.apply(lemmatize_text)
I hope someone can help… Contrary to my past experience with all kinds of issues, I have not been able to find adequate help for my issue online so far.
Thanks!
Upvotes: 2
Views: 1186
Reputation: 645
Do you know NER? It means named entity recognition. You can preprocess your text and locate all named entities, which you then exclude from stemming. After stemming, you can merge the data again.
Upvotes: 2