zzaebok
zzaebok

Reputation: 55

Facebook fasttext bin model UnicodeDecodeError

I downloaded pretrained word vector file (.bin) from facebook (https://fasttext.cc/docs/en/crawl-vectors.html) However, when I tried to use this model it happens to make error.

from gensim.models import FastText
fasttext_model = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

But weird thing is that it operates well when I use old version bin file (https://fasttext.cc/docs/en/pretrained-vectors.html)

What is wrong with these files?? And how can I fix it??

And I must use bin file because I need all n-grams to prevent OOV. So, solutions like 'use .vec file' couldn't be any help.

Thank you so much :)

Upvotes: 0

Views: 1996

Answers (3)

PinkBanter
PinkBanter

Reputation: 1976

It is better to load the fastText word embeddings using the fastText package rather than gensim.

You need to first install the fasttext module for python using pip install fasttext

Then follow the python code chunk from below:

import fasttext
model = fasttext.load_model("path/2/pretrained_fastText_word_embeddings.bin")

Source of the code:

Upvotes: 1

zzaebok
zzaebok

Reputation: 55

It turned out that FB Koean fasttext model has some weird unicodes and gensim will update this problem.

https://github.com/RaRe-Technologies/gensim/issues/2402

Upvotes: 0

gojomo
gojomo

Reputation: 54243

Make sure you're using the latest (3.7.1) version of gensim; there have been recent fixes & improvements to load_fasttext_model().

Also, double-check your download of cc.ko.300.bin, to be sure it hasn't bee corrupted or truncated.

If neither of these help, try enabling logging at the INFO level, try the load again, and share the full output and error stack inside your question to give more hints about where things are going wrong.

Upvotes: 0

Related Questions