Reputation: 1
Trying this project: webvectors This code works fine
nlpl_zip="C:/180.zip"
with zipfile.ZipFile(nlpl_zip, "r") as archive:
stream = archive.open("model.bin")
model = gensim.models.KeyedVectors.load_word2vec_format(
stream, binary=True,unicode_errors='replace'
)
But when I tried to load model from http://vectors.nlpl.eu/repository/20/212.zip to folder C:/212.zip it doesn't work out, cause there is no model.bin inside. Only these ones:
But when I try
stream = archive.open("model.ckpt.data-00000-of-00001")
I've got the following. What am I doing wrong?
UnicodeDecodeError Traceback (most recent call last)
Cell In[11], line 9
7 with zipfile.ZipFile(model_file, 'r') as archive:
8 stream = archive.open('model.ckpt.data-00000-of-00001')
9 model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=True,unicode_errors='replace')
File C:\ProgramData\anaconda3\lib\sitepackages\gensim\models\keyedvectors.py:1719, in KeyedVectors.load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, no_header)
1672 @classmethod
1673 def load_word2vec_format(
1674 cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
1675 limit=None, datatype=REAL, no_header=False,
1676 ):
1677 """Load KeyedVectors from a file produced by the original C word2vec-tool format.
1678
1679 Warnings
(...)
1717
1718 """
1719 return _load_word2vec_format(
1720 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors,
1721 limit=limit, datatype=datatype, no_header=no_header,
1722 )
File C:\ProgramData\anaconda3\lib\sitepackages\gensim\models\keyedvectors.py:2058, in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, no_header, binary_chunk_size)
2056 fin = utils.open(fname, 'rb')
2057 else:
2058 header = utils.to_unicode(fin.readline(), encoding=encoding)
2059 vocab_size, vector_size = [int(x) for x in header.split()] # throws for invalid file format
2060 if limit:
File C:\ProgramData\anaconda3\lib\site-packages\gensim\utils.py:365, in any2unicode(text, encoding, errors)
363 if isinstance(text, str):
364 return text
365 return str(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 1: invalid continuation byte
tried many ways but failed
Upvotes: 0
Views: 65
Reputation: 54153
As far as I can tell, none of the 'ELMo' downloads at the site you've identified include word-vectors of the formats readable by Gensim.
So, you'd need to check the docs about those files, from the crestors of those files or the tools they say they used, to identify what parts of the downloads would be word vectors, and in what format.
Upvotes: 0