Reputation: 107
I am trying to get the tf-idf values for Japanese words. The problem I am having is that sklearn TfidfVectorizer removes some Japanese characters, which I want to keep, as stop words.
The following is the example:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None)
words_list = ["歯","が","痛い"]
tfidf_matrix = tf.fit_transform(words_list)
feature_names = tf.get_feature_names()
print (feature_names)
The output is:['痛い']
However, I want to keep all those three characters in the list. I believe TfidfVectorizer removes characters with length of 1 as stop words. How could I deactivate the default stop words feature and keep all characters?
Upvotes: 4
Views: 1105
Reputation: 214927
You can change the token_pattern parameter from (?u)\\b\\w\\w+\\b
(default) to (?u)\\b\\w\\w*\\b
; The default matches token that has two or more word characters (in case you are not familiar with regex, +
means one or more, so \\w\\w+
matches word with two or more word characters; *
on the other hand means zero or more, \\w\\w*
will thus match word with one or more characters):
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None, token_pattern='(?u)\\b\\w\\w*\\b')
words_list = ["歯","が","痛い"]
tfidf_matrix = tf.fit_transform(words_list)
feature_names = tf.get_feature_names()
print(feature_names)
# ['が', '歯', '痛い']
Upvotes: 4