andy lacron
andy lacron

Reputation: 71

I get an (AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0) when i try to load google news vector embeddings

I have a class Featurizer that checks for the existance of an embedding_file which has word embeddings from Google news vectors and loads it when called. However when i use the class Feauturizer to load the model.

It gives an error

    AttributeError                            Traceback (most recent call last)
d:\mt 111\QuestionAnswer\training_model.ipynb Cell 11' in <cell line: 2>()
      1 emb_file = os.path.join('D:\mt 111\QuestionAnswer\embedding_file', 'GoogleNews-vectors-negative300.bin')
----> 2 featurizer = Featurizer(emb_file)

d:\mt 111\QuestionAnswer\training_model.ipynb Cell 4' in Featurizer.__init__(self, embedding_file)
     11 print('INFO: Loading word vectors...')
     12 self.word2vec = KeyedVectors.load_word2vec_format(
     13     'GoogleNews-vectors-negative300.bin',
     14     binary=True)
     16 print('INFO: Done! Using %s word vectors from pre-trained word2vec.' \
---> 17     %len(self.word2vec.vocab))

File d:\mt 111\QuestionAnswer\venv\lib\site-packages\gensim\models\keyedvectors.py:735, in KeyedVectors.vocab(self)
    733 @property
    734 def vocab(self):
--> 735     raise AttributeError(
    736         "The vocab attribute was removed from KeyedVector in Gensim 4.0.0.\n"
    737         "Use KeyedVector's .key_to_index dict, .index_to_key list, and methods "
    738         ".get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.\n"
    739         "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
    740     )

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val)instead.

See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

This is the class,

class Featurizer:

    def __init__(self, embedding_file):

        if not os.path.exists(embedding_file):
            raise IOError("Embeddings file does not exist: %s" %embedding_file)

        punctuation = string.punctuation
        punctuation = punctuation + "’" + "“" + "?" + "‘"
        self.punctuation = punctuation
        print('INFO: Loading word vectors...')
        self.word2vec = KeyedVectors.load_word2vec_format(
            embedding_file,
            binary=True)

        print('INFO: Done! Using %s word vectors from pre-trained word2vec.' \
            %len(self.word2vec.vocab))

When i try to load the embeddings using the class Featurizer

emb_file = os.path.join('D:\mt 111\QuestionAnswer\embedding_file', 'GoogleNews-vectors-negative300.bin')
featurizer = Featurizer(emb_file)

Ideally, if it loaded properly. It would give a message output from the Featurizer class such as

emb_file = os.path.join('D:\mt 111\QuestionAnswer\embedding_file', 'GoogleNews-vectors-negative300.bin')
featurizer = Featurizer(emb_file)

INFO: Loading word vectors...

INFO: Done! Using 3000000 word vectors from pre-trained word2vec.

How can i go about this!!!

Upvotes: 1

Views: 1371

Answers (1)

gojomo
gojomo

Reputation: 54183

The load succeeded; the failure was in your line of code that tried to report len(self.word2vec.vocab).

Let me quote the error message for the reason that your code couldn't access a .vocab property:

The vocab attribute was removed from KeyedVector in Gensim 4.0.0. Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead. See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

So, you can't use .vocab anymore, but there are several new properties listed there, like .key_to_index (a dict like vocab was) or .index_to_key (a list of all lookup keys – words – in the set-of-vectors).

Have you tried using any of those specific properties recommended in the error message you received, instead of .vocab?

Or, visiting the recommended URL, which makes specific suggestions with before and after code examples how to replace references to the no-longer-available .vocab attribute? Here are the relevant lines of things not to do (🚫), and to do instead (👍), for your case:

vocab_len = len(model.wv.vocab)  # 🚫
…
vocab_len = len(model.wv)  # 👍

Upvotes: 2

Related Questions