How to deactivate the default stop words feature for sklearn TfidfVectorizer

Question

I am trying to get the tf-idf values for Japanese words. The problem I am having is that sklearn TfidfVectorizer removes some Japanese characters, which I want to keep, as stop words.

The following is the example:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None)

words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print (feature_names)

The output is:['痛い']

However, I want to keep all those three characters in the list. I believe TfidfVectorizer removes characters with length of 1 as stop words. How could I deactivate the default stop words feature and keep all characters?

akuiper · Accepted Answer

You can change the token_pattern parameter from (?u)\b\w\w+\b (default) to (?u)\b\w\w*\b; The default matches token that has two or more word characters (in case you are not familiar with regex, + means one or more, so \w\w+ matches word with two or more word characters; * on the other hand means zero or more, \w\w* will thus match word with one or more characters):

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None, token_pattern='(?u)\b\w\w*\b')

words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print(feature_names)
# ['が', '歯', '痛い']

How to deactivate the default stop words feature for sklearn TfidfVectorizer

Answers (1)

Related Questions