Reputation: 55
I downloaded pretrained word vector file (.bin) from facebook (https://fasttext.cc/docs/en/crawl-vectors.html) However, when I tried to use this model it happens to make error.
from gensim.models import FastText
fasttext_model = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
But weird thing is that it operates well when I use old version bin file (https://fasttext.cc/docs/en/pretrained-vectors.html)
What is wrong with these files?? And how can I fix it??
And I must use bin file because I need all n-grams to prevent OOV. So, solutions like 'use .vec file' couldn't be any help.
Thank you so much :)
Upvotes: 0
Views: 1996
Reputation: 1976
It is better to load the fastText word embeddings using the fastText package rather than gensim.
You need to first install the fasttext module for python using pip install fasttext
Then follow the python code chunk from below:
import fasttext
model = fasttext.load_model("path/2/pretrained_fastText_word_embeddings.bin")
Source of the code:
Upvotes: 1
Reputation: 55
It turned out that FB Koean fasttext model has some weird unicodes and gensim will update this problem.
https://github.com/RaRe-Technologies/gensim/issues/2402
Upvotes: 0
Reputation: 54243
Make sure you're using the latest (3.7.1) version of gensim; there have been recent fixes & improvements to load_fasttext_model()
.
Also, double-check your download of cc.ko.300.bin
, to be sure it hasn't bee corrupted or truncated.
If neither of these help, try enabling logging at the INFO level, try the load again, and share the full output and error stack inside your question to give more hints about where things are going wrong.
Upvotes: 0