Gensim: raise KeyError("word '%s' not in vocabulary" % word)

Question

I have this code and I have list of article as dataset. Each raw has an article.

I run this code:

import gensim    
docgen = TokenGenerator( raw_documents, custom_stop_words )    
# the model has 500 dimensions, the minimum document-term frequency is 20    
w2v_model = gensim.models.Word2Vec(docgen, size=500, min_count=20, sg=1)    
print( "Model has %d terms" % len(w2v_model.wv.vocab) )    
w2v_model.save("w2v-model.bin")    
# To re-load this model, run    
#w2v_model = gensim.models.Word2Vec.load("w2v-model.bin")    
    def calculate_coherence( w2v_model, term_rankings ):
        overall_coherence = 0.0
        for topic_index in range(len(term_rankings)):
            # check each pair of terms
            pair_scores = []
            for pair in combinations(term_rankings[topic_index], 2 ):
                pair_scores.append( w2v_model.similarity(pair[0], pair[1]) )
            # get the mean for all pairs in this topic
            topic_score = sum(pair_scores) / len(pair_scores)
            overall_coherence += topic_score
        # get the mean score across all topics
        return overall_coherence / len(term_rankings)

import numpy as np    
def get_descriptor( all_terms, H, topic_index, top ):    
    # reverse sort the values to sort the indices    
    top_indices = np.argsort( H[topic_index,:] )[::-1]    
    # now get the terms corresponding to the top-ranked indices    
    top_terms = []    
    for term_index in top_indices[0:top]:    
        top_terms.append( all_terms[term_index] )    
    return top_terms    
from itertools import combinations    
k_values = []    
coherences = []    
for (k,W,H) in topic_models:    
    # Get all of the topic descriptors - the term_rankings, based on top 10 terms
    term_rankings = []    
    for topic_index in range(k):
        term_rankings.append( get_descriptor( terms, H, topic_index, 10 ) )

    # Now calculate the coherence based on our Word2vec model
    k_values.append( k )
    coherences.append( calculate_coherence( w2v_model, term_rankings ) )
    print("K=%02d: Coherence=%.4f" % ( k, coherences[-1] ) )

I face with this error:

raise KeyError("word '%s' not in vocabulary" % word)

KeyError: u"word 'business' not in vocabulary"

The original code works great with their data set.

https://github.com/derekgreene/topic-model-tutorial

Could you help what this error is?

gojomo · Accepted Answer

It could help answerers if you included more of the information around the error message, such as the multiple-lines of call-frames that will clearly indicate which line of your code triggered the error.

However, if you receive the error KeyError: u"word 'business' not in vocabulary", you can trust that your Word2Vec instance, w2v_model, never learned the word 'business'.

This might be because it didn't appear in the training data the model was presented, or perhaps appeared but fewer than min_count times.

As you don't show the type/contents of your raw_documents variable, or code for your TokenGenerator class, it's not clear why this would have gone wrong – but those are the places to look. Double-check that raw_documents has the right contents, and that individual items inside the docgen iterable-object look like the right sort of input for Word2Vec.

Each item in the docgen iterable object should be a list-of-string-tokens, not plain strings or anything else. And, the docgen iterable must be possible of being iterated-over multiple times. For example, if you execute the following two lines, you should see the same two lists-of-string tokens (looking something like ['hello', 'world']:

print(iter(docgen).next())
print(iter(docgen).next())

If you see plain strings, docgen isn't providing the right kind of items for Word2Vec. If you only see one item printed, docgen is likely a simple single-pass iterator, rather than an iterable object.

You could also enable logging at the INFO level and watch the output during the Word2Vec step carefully, and pay extra attention to any numbers/steps that seem incongruous. (For example, do any steps indicate nothing is happening, or do the counts of words/text-examples seem off?)

Gensim: raise KeyError("word '%s' not in vocabulary" % word)

Answers (1)

Related Questions

Gensim: raise KeyError(&quot;word &#39;%s&#39; not in vocabulary&quot; % word)

Answers (1)

Related Questions

Gensim: raise KeyError("word '%s' not in vocabulary" % word)