Reputation: 2907
I have a list with songs, something like
list2 = ["first song", "second song", "third song"...]
Here is my code:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
bagOfWords = vectorizer.fit(list2)
bagOfWords = vectorizer.transform(list2)
And it's working, but I want to stem a list of my words.
I've tried to make it this way
def tokeni(self,data):
return [SnowballStemmer("english").stem(word) for word in data.split()]
vectorizer = CountVectorizer(stop_words=stopwords.words('english'),
tokenizer=self.tokeni)
but it didn't work. What am I doing wrong?
Update : with tokenizer I have words like "oh...", "s-like..." , "knees," when without tokenizer I don't have any words with dots, commas, etc
Upvotes: 0
Views: 1536
Reputation: 17629
You can pass a custom preprocessor
which should work just as well, but retain the functionality of the tokenizer
:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import SnowballStemmer
list2 = ["rain", "raining", "rainy", "rainful", "rains", "raining!", "rain?"]
def preprocessor(data):
return " ".join([SnowballStemmer("english").stem(word) for word in data.split()])
vectorizer = CountVectorizer(preprocessor=preprocessor).fit(list2)
print vectorizer.vocabulary_
# Should print this:
# {'raining': 2, 'raini': 1, 'rain': 0}
Upvotes: 2