Reputation: 4485
I have to read a text document which contains both English and non-English (Malayalam specifically) languages in Python. The following I see:
>>>text_english = 'Today is a good day'
>>>text_non_english = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
Now, if I write a code to extract the first letter using
>>>print(text_english[0])
'T'
and when I run
>>>print(text_non_english[0])
�
To get the first letter, I have to write the following
>>>print(text_non_english[0:3])
ആ
Why this happens? My aim to extract the words in the text so that I can input it to the tfidf transformer. When I create the tfidf vocabulary from the Malayalam language, there are words which are two letters which is not correct. Actually they are part of the full words. What should i do so that the tfidf transformer takes the full Malayalam word for the transformation instead of taking two letters.
I used the following code for this
>>>useful_text_1[1:3] # contains both English and Malayalam text
>>>vectorizer = TfidfVectorizer(sublinear_tf=True,max_df=0.5,stop_words='english')
# Learn vocabulary and idf, return term-document matrix
>>>vect_2 = vectorizer.fit_transform(useful_text_1[1:3])
>>>vectorizer.vocabulary_
Some of the words in the vocabulary are as below:
ഷമ
സന
സഹ
ർക
ർത
The vocabulary is not correct. It is not considering the whole word. How to rectify this?
Upvotes: 1
Views: 3507
Reputation: 21
Alternative is to try Text2Text to get the TFIDF vectors. It supports 100s of languages, including Malayalam.
import text2text as t2t
t2t.Handler([
'Today is a good day',
'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
]).tfidf()
Upvotes: 1
Reputation: 21
Using a dummy tokenizer actually worked for me
vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(), min_df=1)
>>> tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
>>> vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(),min_df=1)
>>> vect_2 = vectorizer.fit_transform(tn.split())
>>> for x in vectorizer.vocabulary_:
... print x
...
സന്തോഷമാഗ്രഹിക്കാത്തത
ആരാണു
>>>
Upvotes: 2
Reputation: 1253
You have to encode text in utf-8. But Malayalam language's letter contains 3 symbols, so you need to use unicode function:
In[36]: tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
In[37]: tne=unicode(tn, encoding='utf-8')
In[38]: print(tne[0])
ആ
Upvotes: 2