user2534517
user2534517

Reputation: 41

Python: Unicode encoding of returned String from parsed Query (MeCab)

I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decode, it throws an error. Here is my code:

#!/usr/bin/python
# -*- coding:utf-8 -*-

import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCabで遊んでみよう!'

print text
result = tagger.parse(text)
print result

result = unicode(result, 'utf-8')
print result

This is my output:

MeCabで遊んでみよう!
MeCab �� �� ��んで�� �� ��う! 

Traceback (most recent call last):
  File "test.py", line 12, in <module>
    result = unicode(result, 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte

------------------
(program exited with code: 1)
Press return to continue

Also, my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.

Any ideas?

Upvotes: 2

Views: 1176

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 178021

MeCab doesn't return UTF8 by default. Below is a quote from the following link (via Google Translate):

http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html#charset

Unless otherwise specified, euc is used. If you would like to use the utf8 or shift-jis, change the charset with configure options dictionary, please rebuild the dictionary. Now, and shift-jis, dictionary of utf8 is created.

Try result = tagger.parse(text).decode('euc-jp').

Upvotes: 1

Roman Bodnarchuk
Roman Bodnarchuk

Reputation: 29737

Looks like your assumption that MeCab returns UTF8 string is wrong. So, in you conversion to unicode you have to use some other encoding (e.g. iso2022_jp, exact choise of encoding depends on MeCab innards).

Upvotes: 0

Related Questions