Rishabh Talwar
Rishabh Talwar

Reputation: 11

keyerror while implementing the word2vec model in natural language processing

i=0
list_of_sent=[]
for sent in df["Heading"]:
    filtered_sentence=[]
    for w in sent.split():
        
        if len(w)==0:
            continue
        print(w)    
        for cleaned_words in clean_punc(w).split():
            if(cleaned_words.isalpha()):
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue
    list_of_sent.append(filtered_sentence)

I want to apply word2vec model for which i am first converting my data column values into list of sentences and the cleanpunc is the following function:-

import re
def clean_punc(sentence):
    cleaned=re.sub(r'[?|!| \'|"|#]',r'',sentence)
    cleaned=re.sub(r'[.|,)|(|\|/]',r' ',cleaned)
    return cleaned

and i am applying word2vec model:

w2v_model=gensim.models.Word2Vec(list_of_sent,min_count=1,vector_size=50,workers=4)

and when i run the following code:-

words=list(w2v_model.wv)
print(len(words))

I am getting the error:-

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_38/1883829707.py in <module>
----> 1 words=list(w2v_model.wv)
      2 print(len(words))

/opt/conda/lib/python3.7/site-packages/gensim/models/keyedvectors.py in __getitem__(self, key_or_keys)
    377         """
    378         if isinstance(key_or_keys, KEY_TYPES):
--> 379             return self.get_vector(key_or_keys)
    380 
    381         return vstack([self.get_vector(key) for key in key_or_keys])

/opt/conda/lib/python3.7/site-packages/gensim/models/keyedvectors.py in get_vector(self, key, norm)
    420 
    421         """
--> 422         index = self.get_index(key)
    423         if norm:
    424             self.fill_norms()

/opt/conda/lib/python3.7/site-packages/gensim/models/keyedvectors.py in get_index(self, key, default)
    394             return default
    395         else:
--> 396             raise KeyError(f"Key '{key}' not present")
    397 
    398     def get_vector(self, key, norm=False):

KeyError: "Key '141101' not present"

please help me in resolving the error

Upvotes: 0

Views: 856

Answers (1)

gojomo
gojomo

Reputation: 54243

I'm not sure what's happening when you try to do list(w2v_model.wv) - but that isn't a typical or necessary operation, and it looks like your traceback isn't showing how the list() invocation is turning into some __getitem__ call against the vectors.

So, I'd suggest you achieve whatever your actual goal is without this unusual attempt to cast a KeyedVectors as a list.

For example, if you just want the number of words learned by the model, you can just do:

print(len(w2v_model.wv)

If you want a list of the words themselves, the words are the keys with which you can look up vectors, so the list of words is in:

w2v_model.wv.index_to_key

You can get a dict which maps each word to its relative index position, inside the backing array, via:

w2v_model.wv.key_to_index

(Checking the len() of this dict is another way to count the words, or asking for its .keys() will show all the words.)

Separately: min_count=1 is almost always a bad idea with this algorithm. Rare words don't have enough varied exampls to train the model with a good vector for those rare words... but they appear often enough, overall, in usual natural-language text that they can serve as 'noise' weakening other words. It's almost always better to discard words that only appear a few times, as for example how the min_count=5 default works, to make the other words train faster, & better.

Upvotes: 1

Related Questions