Israel Zinc
Israel Zinc

Reputation: 2769

Parsing Japanese Python

*****EDITED WITH THE FULL CODE******

I am trying to parse some Japanese code using Python (Version 3.5.3) and the MeCab library on MacOS.

I have a txt file with the following text:

石の上に三年

I set my preferences on my textEdit to save using utf-8. So I believe the system is correctly saving it in the utf-8 format.

I got the following error:

Traceback (most recent call last):   File "japanese.py", line 29, in <module>
    words = extractMetadataFromTXT(fileName)   File "japanese.py", line 14, in extractMetadataFromTXT
    md = extractWordsJP(data)   File "japanese.py", line 22, in extractWordsJP
    components.append(parsed.surface) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte

Bellow goes my full code. Nothing missing.

import MeCab
import nltk
from nltk import *
from nltk.corpus import knbc

mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")
wordsList = knbc.words()
fdist = nltk.FreqDist(w.lower() for w in wordsList)

def extractMetadataFromTXT(filePath):
    with open(filePath, 'r', encoding='utf-8') as f:
        data = f.read()
        print(data)
    md = extractWordsJP(data)
    print(md)
    return md

def extractWordsJP(wordsJP):
    components = []
    parsed = mt.parseToNode(wordsJP)
    while parsed:
        components.append(parsed.surface)
        parsed = parsed.next
    return components

if __name__ == "__main__":
    fileName = "simple_japanese.txt"
    words = extractMetadataFromTXT(fileName)
    print(words)

Has anyone any clue of why I am getting this error message?

Funny fact: Sometimes it works. :O

Thanks in advance,

Israel

Upvotes: 0

Views: 2029

Answers (3)

Israel Zinc
Israel Zinc

Reputation: 2769

Solution:

Apparently, the problem was with MeCab, not with the python code itself. This problem was that when you install it from the scratch, using make, sometimes it doesn't install properly but it doesn't raise any error.

I am not sure about why, but if you wanna dig further and find out what is exactly happening, it would be great. I only know that I uninstalled and re-installed again using brew, and it worked.

Similar stuff happened in other Macs from the office. I am using brew in OS X, so I will post the command I used to install it properly:

brew install mecab mecab-ipadic git curl xz

Also, to install it on linux, use the following command:

sudo apt-get install mecab libmecab-dev mecab-ipadic
sudo apt-get install mecab-ipadic-utf8
sudo apt-get install python-mecab

Hope this helps future people trying to tag Japanese words.

Upvotes: 1

Yann Vernier
Yann Vernier

Reputation: 15887

The error is happening because you're feeding something that isn't valid UTF-8 into a UTF-8 decoder. This could be caused by splitting bytes rather than characters, or perhaps by incorrectly attempting to decode another encoding like JIS or EUC as if it were UTF-8. In Python it's generally sound to stick to unicode strings, and your system might switch to decoding text files if something has set the locale parameters. Even when you do have proper unicode strings splitting is a non-trivial issue as there are codes to modify others, such as accents. Japanese doesn't have much of that sort of thing, luckily (unless someone happens to encode po as ho+ring etc).

One potential issue: Mecab's webpage states (per google translate) "Unless otherwise specified, euc is used." If Mecab is word splitting under the assumption it is reading EUC, it will mangle UTF-8.

Upvotes: 1

Jose Haro Peralta
Jose Haro Peralta

Reputation: 999

When you open the file, specify the encoding:

with open(file, 'r', encoding='utf-8') as f:
    data = f.read()

...

BTW, when opening the file, use a context manager as shown in this example.

Upvotes: -1

Related Questions