Reputation: 2825
I am trying to use Spacy's Japanese tokenizer.
import spacy
Question= 'すぺいんへ いきました。'
nlp(Question.decode('utf8'))
I am getting the below error,
TypeError: Expected unicode, got spacy.tokens.token.Token
Any ideas on how to fix this?
Thanks!
Upvotes: 6
Views: 5336
Reputation: 15633
I am not sure why you got that particular bug, but Japanese support has been improved since you posted this question and it should work with the latest version of spaCy. For Japanese support you'll also need to install MeCab and some other dependencies yourself, see here for a detailed guide.
Actual code would look like this:
import spacy
ja = spacy.blank('ja')
print(ja('日本語ですよ'))
If you still have trouble please feel free to file an issue on Github.
Upvotes: 1
Reputation: 1580
Try using this:
import spacy
question = u'すぺいんへ いきました。'
nlp(question)
Upvotes: 2
Reputation: 670
According to Spacy, tokenization for Japanese language using spacy is still in alpha phase. The ideal way for tokenization is to provide tokenized word list with information pertaining to language structure also. For example, for a english language sentence, you can try this
import spacy
nlp = spacy.load("en") # execute "python -m spacy download en" before this on standard console
sentence = "Writing some answer on stackoverflow, as an example for spacy language model"
print(["::".join((word.orth_, word.pos_)) for word in nlp(sentence)])
## <OUTPUT>
## ['Writing::VERB', 'some::DET', 'answer::NOUN', 'on::ADP', 'stackoverflow::NOUN', ',::PUNCT', 'as::ADP', 'an::DET', 'example::NOUN', 'for::ADP', 'spacy::ADJ', 'language::NOUN', 'model::NOUN']
Such results are currently not available for Japanese Language.
If you use python -m spacy download xx
and use nlp = spacy.load("xx")
, it tries best to understand named entities
Also if you look at the source for spacy at here, you will see that tokenization is available, but it brings forth only a make_doc
function, which is quite naive.
Note: The pip version of spacy is still old code, the above link for github still has a bit to latest code.
So for building a tokenization, it is highly suggested as of now to use janome An example for this is given below
from janome.tokenizer import Tokenizer as janome_tokenizer
sentence = "日本人のものと見られる、延べ2億件のメールアドレスとパスワードが闇サイトで販売されていたことがわかりました。過去に漏えいしたデータを集めたものと見られ、調査に当たったセキュリティー企業は、日本を狙ったサイバー攻撃のきっかけになるおそれがあるとして注意を呼びかけています。"
token_object = janome_tokenizer()
[x.surface for x in token_object.tokenize(sentence)]
## <OUTPUT> ##
## ['日本人', 'の', 'もの', 'と', '見', 'られる', '、', '延べ', '2', '億', '件', 'の', 'メールアドレス', 'と', 'パスワード', 'が', '闇', 'サイト', 'で', '販売', 'さ', 'れ', 'て', 'い', 'た', 'こと', 'が', 'わかり', 'まし', 'た', '。', '過去', 'に', '漏えい', 'し', 'た', 'データ', 'を', '集め', 'た', 'もの', 'と', '見', 'られ', '、', '調査', 'に', '当たっ', 'た', 'セキュリティー', '企業', 'は', '、', '日本', 'を', '狙っ', 'た', 'サイバー', '攻撃', 'の', 'きっかけ', 'に', 'なる', 'お', 'それ', 'が', 'ある', 'として', '注意', 'を', '呼びかけ', 'て', 'い', 'ます', '。']
## you can look at
## for x in token_object.tokenize(sentence):
## print(x)
## <OUTPUT LIKE>:
## 日本人 名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
## の 助詞,連体化,*,*,*,*,の,ノ,ノ
## もの 名詞,非自立,一般,*,*,*,もの,モノ,モノ
## と 助詞,格助詞,引用,*,*,*,と,ト,ト
## ....
## <OUTPUT Truncated>
I think spacy team is working on similar output to build models for Japanese language so that "language specific" constructs could be made for Japanese also, similar to the ones for other languages.
Update
After writing, out of curiosity, I started to search around. Please check udpipe here, here & here It seems udpipe supports more than 50 languages, and it provides solution to problem we see in Spacy as far as language support is concerned.
Upvotes: 4