user1506145
user1506145

Reputation: 5296

Special characters in countVectorizer Scikit-learn

Consider this runnable example:

#coding: utf-8
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['öåa hej ho' 'åter aba na', 'äs äp äl']
x = vectorizer.fit_transform(corpus)
l =  vectorizer.get_feature_names()

for u in l:
        print u

The output will be

aba
hej
ho
na
ter

Why is the åäö removed? Note that the vectorizer strip_accents=None is default. I would be really grateful if you could help me with this.

Upvotes: 6

Views: 7360

Answers (1)

ogrisel
ogrisel

Reputation: 40169

This is an intentional way to reduce the dimensionality while making the vectorizer tolerant to inputs where the authors are not always consistent with the use of accentuated chars.

If you want to disable that feature, just pass strip_accents=None to CountVectorizer as explained in the documentation of this class.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer(strip_accents='ascii').build_analyzer()(u'\xe9t\xe9')
[u'ete']
>>> CountVectorizer(strip_accents=False).build_analyzer()(u'\xe9t\xe9')
[u'\xe9t\xe9']
>>> CountVectorizer(strip_accents=None).build_analyzer()(u'\xe9t\xe9')
[u'\xe9t\xe9']

Upvotes: 11

Related Questions