Daiki Akiyoshi
Daiki Akiyoshi

Reputation: 107

How to deactivate the default stop words feature for sklearn TfidfVectorizer

I am trying to get the tf-idf values for Japanese words. The problem I am having is that sklearn TfidfVectorizer removes some Japanese characters, which I want to keep, as stop words.

The following is the example:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None)

words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print (feature_names)

The output is:['痛い']

However, I want to keep all those three characters in the list. I believe TfidfVectorizer removes characters with length of 1 as stop words. How could I deactivate the default stop words feature and keep all characters?

Upvotes: 4

Views: 1105

Answers (1)

akuiper
akuiper

Reputation: 214927

You can change the token_pattern parameter from (?u)\\b\\w\\w+\\b (default) to (?u)\\b\\w\\w*\\b; The default matches token that has two or more word characters (in case you are not familiar with regex, + means one or more, so \\w\\w+ matches word with two or more word characters; * on the other hand means zero or more, \\w\\w* will thus match word with one or more characters):

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None, token_pattern='(?u)\\b\\w\\w*\\b')
​
words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print(feature_names)
# ['が', '歯', '痛い']

Upvotes: 4

Related Questions