Can I use TfidfVectorizer in scikit-learn for non-English language? Also how do I read a non-English text in Python?

I have to read a text document which contains both English and non-English (Malayalam specifically) languages in Python. The following I see:

>>>text_english = 'Today is a good day'
>>>text_non_english = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'

Now, if I write a code to extract the first letter using

>>>print(text_english[0])
'T'

and when I run

>>>print(text_non_english[0])
�

To get the first letter, I have to write the following

>>>print(text_non_english[0:3])
ആ

Why this happens? My aim to extract the words in the text so that I can input it to the tfidf transformer. When I create the tfidf vocabulary from the Malayalam language, there are words which are two letters which is not correct. Actually they are part of the full words. What should i do so that the tfidf transformer takes the full Malayalam word for the transformation instead of taking two letters.

I used the following code for this

>>>useful_text_1[1:3] # contains both English and Malayalam text

>>>vectorizer = TfidfVectorizer(sublinear_tf=True,max_df=0.5,stop_words='english')

# Learn vocabulary and idf, return term-document matrix
>>>vect_2 = vectorizer.fit_transform(useful_text_1[1:3])
>>>vectorizer.vocabulary_

Some of the words in the vocabulary are as below:

ഷമ
സന
സഹ
ർക
ർത

The vocabulary is not correct. It is not considering the whole word. How to rectify this?

Upvotes: 1

Answers (3)

crosslingual

Reputation: 21

Alternative is to try Text2Text to get the TFIDF vectors. It supports 100s of languages, including Malayalam.

import text2text as t2t

t2t.Handler([
  'Today is a good day',
  'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
]).tfidf()

Upvotes: 1

Curie

Reputation: 21

Using a dummy tokenizer actually worked for me

vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(), min_df=1)

>>> tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
>>> vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(),min_df=1)
>>> vect_2 = vectorizer.fit_transform(tn.split())
>>> for x in vectorizer.vocabulary_:
...     print x
... 
സന്തോഷമാഗ്രഹിക്കാത്തത
ആരാണു
>>>

Upvotes: 2

Peter

Reputation: 1253

You have to encode text in utf-8. But Malayalam language's letter contains 3 symbols, so you need to use unicode function:

In[36]: tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
In[37]: tne=unicode(tn, encoding='utf-8')
In[38]: print(tne[0])
ആ

Upvotes: 2

Can I use TfidfVectorizer in scikit-learn for non-English language? Also how do I read a non-English text in Python?

Answers (3)

Related Questions