Reputation: 373
I am trying to load a pre-trained glove as a word2vec model in gensim. I have downloaded the glove file from here. I am using the following script:
from gensim import models
model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)
but get the following error
ValueError Traceback (most recent call last)
<ipython-input-38-e0b48b51f433> in <module>()
1 from gensim import models
----> 2 model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)
2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/utils_any2vec.py in <genexpr>(.0)
171 with utils.smart_open(fname) as fin:
172 header = utils.to_unicode(fin.readline(), encoding=encoding)
--> 173 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
174 if limit:
175 vocab_size = min(vocab_size, limit)
ValueError: invalid literal for int() with base 10: 'the'
What is the underlying problem? Does gensim need a specific format to be able to load it?
Upvotes: 4
Views: 8690
Reputation: 54193
The GLoVe format is slightly different – missing a 1st-line declaration of vector-count & dimensions – than the format that load_word2vec_format()
supports.
There's a glove2word2vec
utility script included you can run once to convert the file:
https://radimrehurek.com/gensim/scripts/glove2word2vec.html
Also, starting in Gensim 4.0.0 (currentlyu in prerelease testing), the load_word2vec_format()
method gets a new optional no_header
parameter:
If set as no_header=True
, the method will deduce the count/dimensions from a preliminary scan of the file - so it can read a GLoVe file with that option – but at the cost of two full-file reads instead of one. (So, you may still want to re-save the object with .save_word2vec_format()
, or use the glove2word2vec
script, to make future loads faster.)
Upvotes: 7