Reputation: 363
I have this code and I have list of article as dataset. Each raw has an article.
I run this code:
import gensim
docgen = TokenGenerator( raw_documents, custom_stop_words )
# the model has 500 dimensions, the minimum document-term frequency is 20
w2v_model = gensim.models.Word2Vec(docgen, size=500, min_count=20, sg=1)
print( "Model has %d terms" % len(w2v_model.wv.vocab) )
w2v_model.save("w2v-model.bin")
# To re-load this model, run
#w2v_model = gensim.models.Word2Vec.load("w2v-model.bin")
def calculate_coherence( w2v_model, term_rankings ):
overall_coherence = 0.0
for topic_index in range(len(term_rankings)):
# check each pair of terms
pair_scores = []
for pair in combinations(term_rankings[topic_index], 2 ):
pair_scores.append( w2v_model.similarity(pair[0], pair[1]) )
# get the mean for all pairs in this topic
topic_score = sum(pair_scores) / len(pair_scores)
overall_coherence += topic_score
# get the mean score across all topics
return overall_coherence / len(term_rankings)
import numpy as np
def get_descriptor( all_terms, H, topic_index, top ):
# reverse sort the values to sort the indices
top_indices = np.argsort( H[topic_index,:] )[::-1]
# now get the terms corresponding to the top-ranked indices
top_terms = []
for term_index in top_indices[0:top]:
top_terms.append( all_terms[term_index] )
return top_terms
from itertools import combinations
k_values = []
coherences = []
for (k,W,H) in topic_models:
# Get all of the topic descriptors - the term_rankings, based on top 10 terms
term_rankings = []
for topic_index in range(k):
term_rankings.append( get_descriptor( terms, H, topic_index, 10 ) )
# Now calculate the coherence based on our Word2vec model
k_values.append( k )
coherences.append( calculate_coherence( w2v_model, term_rankings ) )
print("K=%02d: Coherence=%.4f" % ( k, coherences[-1] ) )
I face with this error:
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: u"word 'business' not in vocabulary"
The original code works great with their data set.
https://github.com/derekgreene/topic-model-tutorial
Could you help what this error is?
Upvotes: 1
Views: 1601
Reputation: 54173
It could help answerers if you included more of the information around the error message, such as the multiple-lines of call-frames that will clearly indicate which line of your code triggered the error.
However, if you receive the error KeyError: u"word 'business' not in vocabulary"
, you can trust that your Word2Vec
instance, w2v_model
, never learned the word 'business'
.
This might be because it didn't appear in the training data the model was presented, or perhaps appeared but fewer than min_count
times.
As you don't show the type/contents of your raw_documents
variable, or code for your TokenGenerator
class, it's not clear why this would have gone wrong – but those are the places to look. Double-check that raw_documents
has the right contents, and that individual items inside the docgen
iterable-object look like the right sort of input for Word2Vec
.
Each item in the docgen
iterable object should be a list-of-string-tokens, not plain strings or anything else. And, the docgen
iterable must be possible of being iterated-over multiple times. For example, if you execute the following two lines, you should see the same two lists-of-string tokens (looking something like ['hello', 'world']
:
print(iter(docgen).next())
print(iter(docgen).next())
If you see plain strings, docgen
isn't providing the right kind of items for Word2Vec
. If you only see one item printed, docgen
is likely a simple single-pass iterator, rather than an iterable object.
You could also enable logging at the INFO
level and watch the output during the Word2Vec
step carefully, and pay extra attention to any numbers/steps that seem incongruous. (For example, do any steps indicate nothing is happening, or do the counts of words/text-examples seem off?)
Upvotes: 2