Reputation: 11
i=0
list_of_sent=[]
for sent in df["Heading"]:
filtered_sentence=[]
for w in sent.split():
if len(w)==0:
continue
print(w)
for cleaned_words in clean_punc(w).split():
if(cleaned_words.isalpha()):
filtered_sentence.append(cleaned_words.lower())
else:
continue
list_of_sent.append(filtered_sentence)
I want to apply word2vec model for which i am first converting my data column values into list of sentences and the cleanpunc is the following function:-
import re
def clean_punc(sentence):
cleaned=re.sub(r'[?|!| \'|"|#]',r'',sentence)
cleaned=re.sub(r'[.|,)|(|\|/]',r' ',cleaned)
return cleaned
and i am applying word2vec model:
w2v_model=gensim.models.Word2Vec(list_of_sent,min_count=1,vector_size=50,workers=4)
and when i run the following code:-
words=list(w2v_model.wv)
print(len(words))
I am getting the error:-
KeyError Traceback (most recent call last)
/tmp/ipykernel_38/1883829707.py in <module>
----> 1 words=list(w2v_model.wv)
2 print(len(words))
/opt/conda/lib/python3.7/site-packages/gensim/models/keyedvectors.py in __getitem__(self, key_or_keys)
377 """
378 if isinstance(key_or_keys, KEY_TYPES):
--> 379 return self.get_vector(key_or_keys)
380
381 return vstack([self.get_vector(key) for key in key_or_keys])
/opt/conda/lib/python3.7/site-packages/gensim/models/keyedvectors.py in get_vector(self, key, norm)
420
421 """
--> 422 index = self.get_index(key)
423 if norm:
424 self.fill_norms()
/opt/conda/lib/python3.7/site-packages/gensim/models/keyedvectors.py in get_index(self, key, default)
394 return default
395 else:
--> 396 raise KeyError(f"Key '{key}' not present")
397
398 def get_vector(self, key, norm=False):
KeyError: "Key '141101' not present"
please help me in resolving the error
Upvotes: 0
Views: 856
Reputation: 54243
I'm not sure what's happening when you try to do list(w2v_model.wv)
- but that isn't a typical or necessary operation, and it looks like your traceback isn't showing how the list()
invocation is turning into some __getitem__
call against the vectors.
So, I'd suggest you achieve whatever your actual goal is without this unusual attempt to cast a KeyedVectors
as a list
.
For example, if you just want the number of words learned by the model, you can just do:
print(len(w2v_model.wv)
If you want a list of the words themselves, the words are the keys with which you can look up vectors, so the list of words is in:
w2v_model.wv.index_to_key
You can get a dict which maps each word to its relative index position, inside the backing array, via:
w2v_model.wv.key_to_index
(Checking the len()
of this dict is another way to count the words, or asking for its .keys()
will show all the words.)
Separately: min_count=1
is almost always a bad idea with this algorithm. Rare words don't have enough varied exampls to train the model with a good vector for those rare words... but they appear often enough, overall, in usual natural-language text that they can serve as 'noise' weakening other words. It's almost always better to discard words that only appear a few times, as for example how the min_count=5
default works, to make the other words train faster, & better.
Upvotes: 1