Is countvectorizer in sklearn only meant for English?

Question

I am trying to apply count vectorizer for Telugu and Hindi which are Indic language.But the vectorizer is stemming the words automatically.

count_vect = CountVectorizer()
xv=count_vect.fit_transform(['she is a good girl','वो बहुत सुन्दर है','ఇది చాలా లాడిష్ మరియు బాల్య టీనేజ్ కుర్రాళ్ళు మాత్రమే దీనిని ఫన్నీగా చూడవచ్చు', 'దోపిడీ మరియు ఎక్కువగా లోతు లేదా అధునాతనత లేని నేరాలకు సంబంధించిన గ్రాఫిక్ చికిత్సను చూడటం భరించదగినది'])
count_vect.get_feature_names()

the output is as follows:

['girl',
 'good',
 'is',
 'she',
 'दर',
 'बह',
 'అధ',
 'ఇద',
 'ఎక',
 'చదగ',
 'డట',
 'డవచ',
 'తనత',
 'నద',
 'ఫన',
 'భర',
 'మర',
 'రమ',
 'లక',
 'వగ',
 'సన']

It is clearly evident that it is stemming the telugu and hindi words automatically, is there any way to avoid that?

XavierBrt · Accepted Answer

The analyzer used by CountVectorizer() seems to badly support some encodings. You can define a custom analyzer, to define how to separate the words. To separate the words properly, you can use a regex:

import regex 

def custom_analyzer(text):
    words = regex.findall(r'\w{2,}', text) # extract words of at least 2 letters
    for w in words:
        yield w

count_vect = CountVectorizer(analyzer = custom_analyzer)
xv = count_vect.fit_transform(['she is a good girl','वो बहुत सुन्दर है','ఇది చాలా లాడిష్ మరియు బాల్య టీనేజ్ కుర్రాళ్ళు మాత్రమే దీనిని ఫన్నీగా చూడవచ్చు', 'దోపిడీ మరియు ఎక్కువగా లోతు లేదా అధునాతనత లేని నేరాలకు సంబంధించిన గ్రాఫిక్ చికిత్సను చూడటం భరించదగినది'])
count_vect.get_feature_names()

I used the regex module because it supports more encodings than the module re (Thanks to this answer for explaining).

Is countvectorizer in sklearn only meant for English?

Answers (1)

Related Questions