Reputation: 41
I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decode
, it throws an error. Here is my code:
#!/usr/bin/python
# -*- coding:utf-8 -*-
import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCabで遊んでみよう!'
print text
result = tagger.parse(text)
print result
result = unicode(result, 'utf-8')
print result
This is my output:
MeCabで遊んでみよう!
MeCab �� �� ��んで�� �� ��う!
Traceback (most recent call last):
File "test.py", line 12, in <module>
result = unicode(result, 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte
------------------
(program exited with code: 1)
Press return to continue
Also, my terminal is able to display Japanese characters properly. For example print '日本語'
works perfectly fine.
Any ideas?
Upvotes: 2
Views: 1176
Reputation: 178021
MeCab doesn't return UTF8 by default. Below is a quote from the following link (via Google Translate):
http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html#charset
Unless otherwise specified, euc is used. If you would like to use the utf8 or shift-jis, change the charset with configure options dictionary, please rebuild the dictionary. Now, and shift-jis, dictionary of utf8 is created.
Try result = tagger.parse(text).decode('euc-jp')
.
Upvotes: 1
Reputation: 29737
Looks like your assumption that MeCab returns UTF8 string is wrong. So, in you conversion to unicode
you have to use some other encoding (e.g. iso2022_jp
, exact choise of encoding depends on MeCab innards).
Upvotes: 0