Arsalan
Arsalan

Reputation: 373

How to load pre-trained glove model with gensim load_word2vec_format?

I am trying to load a pre-trained glove as a word2vec model in gensim. I have downloaded the glove file from here. I am using the following script:

from gensim import models
model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)

but get the following error

ValueError                                Traceback (most recent call last)
<ipython-input-38-e0b48b51f433> in <module>()
      1 from gensim import models
----> 2 model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)

2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/utils_any2vec.py in <genexpr>(.0)
    171     with utils.smart_open(fname) as fin:
    172         header = utils.to_unicode(fin.readline(), encoding=encoding)
--> 173         vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
    174         if limit:
    175             vocab_size = min(vocab_size, limit)

ValueError: invalid literal for int() with base 10: 'the'

What is the underlying problem? Does gensim need a specific format to be able to load it?

Upvotes: 4

Views: 8690

Answers (1)

gojomo
gojomo

Reputation: 54193

The GLoVe format is slightly different – missing a 1st-line declaration of vector-count & dimensions – than the format that load_word2vec_format() supports.

There's a glove2word2vec utility script included you can run once to convert the file:

https://radimrehurek.com/gensim/scripts/glove2word2vec.html

Also, starting in Gensim 4.0.0 (currentlyu in prerelease testing), the load_word2vec_format() method gets a new optional no_header parameter:

https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=load_word2vec_format#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format

If set as no_header=True, the method will deduce the count/dimensions from a preliminary scan of the file - so it can read a GLoVe file with that option – but at the cost of two full-file reads instead of one. (So, you may still want to re-save the object with .save_word2vec_format(), or use the glove2word2vec script, to make future loads faster.)

Upvotes: 7

Related Questions