user26623260
user26623260

Reputation: 1

How to load word2vec model from zip file not having .bin file inside?

Trying this project: webvectors This code works fine

nlpl_zip="C:/180.zip" 
with zipfile.ZipFile(nlpl_zip, "r") as archive:
    stream = archive.open("model.bin")
    model = gensim.models.KeyedVectors.load_word2vec_format(
        stream, binary=True,unicode_errors='replace'
    )

But when I tried to load model from http://vectors.nlpl.eu/repository/20/212.zip to folder C:/212.zip it doesn't work out, cause there is no model.bin inside. Only these ones:

enter image description here

But when I try

stream = archive.open("model.ckpt.data-00000-of-00001")

I've got the following. What am I doing wrong?

UnicodeDecodeError Traceback (most recent call last)
Cell In[11], line 9
7 with zipfile.ZipFile(model_file, 'r') as archive:
8 stream = archive.open('model.ckpt.data-00000-of-00001')
9 model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=True,unicode_errors='replace')

File C:\ProgramData\anaconda3\lib\sitepackages\gensim\models\keyedvectors.py:1719, in KeyedVectors.load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, no_header)
1672 @classmethod
1673 def load_word2vec_format(
1674 cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
1675 limit=None, datatype=REAL, no_header=False,
1676 ):
1677 """Load KeyedVectors from a file produced by the original C word2vec-tool format.
1678
1679 Warnings
    (...)
1717
1718 """
1719 return _load_word2vec_format(
1720 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors,
1721 limit=limit, datatype=datatype, no_header=no_header,
1722 )

File C:\ProgramData\anaconda3\lib\sitepackages\gensim\models\keyedvectors.py:2058, in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, no_header, binary_chunk_size)
2056 fin = utils.open(fname, 'rb')
2057 else:
2058 header = utils.to_unicode(fin.readline(), encoding=encoding)
2059 vocab_size, vector_size = [int(x) for x in header.split()] # throws for invalid file format
2060 if limit:

File C:\ProgramData\anaconda3\lib\site-packages\gensim\utils.py:365, in any2unicode(text, encoding, errors)
363 if isinstance(text, str):
364 return text
365 return str(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 1: invalid continuation byte

tried many ways but failed

Upvotes: 0

Views: 65

Answers (1)

gojomo
gojomo

Reputation: 54153

As far as I can tell, none of the 'ELMo' downloads at the site you've identified include word-vectors of the formats readable by Gensim.

So, you'd need to check the docs about those files, from the crestors of those files or the tools they say they used, to identify what parts of the downloads would be word vectors, and in what format.

Upvotes: 0

Related Questions