olbinado11
olbinado11

Reputation: 161

Error in Computing the Coherence Score – AttributeError: 'dict' object has no attribute 'id2token'

I am a beginner in NLP and it's my first time to do Topic Modeling. I was able to generate my model however I cannot produce the coherence metric.

Converting the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus

sparse_counts = scipy.sparse.csr_matrix(data_dtm)
corpus = matutils.Sparse2Corpus(sparse_counts)
corpus

enter image description here

df_lemmatized.head()

enter image description here

# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
tfidfv = pickle.load(open("tfidf.pkl", "rb"))
id2word = dict((v, k) for k, v in tfidfv.vocabulary_.items())
id2word

enter image description here

This is my model:

lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=15, passes=10, random_state=43)
lda.print_topics()

enter image description here

And finally, here is where I attempted to get Coherence Score Using Coherence Model:

# Compute Perplexity
print('\nPerplexity: ', lda.log_perplexity(corpus))  

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda, texts=df_lemmatized.long_title, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

This is the error:

---> 57 if not dictionary.id2token: # may not be initialized in the standard gensim.corpora.Dictionary 58 setattr(dictionary, 'id2token', {v: k for k, v in dictionary.token2id.items()}) 59 AttributeError: 'dict' object has no attribute 'id2token'

Upvotes: 2

Views: 2706

Answers (1)

Anwarvic
Anwarvic

Reputation: 12992

I don't have your data, so I can't reproduce the error. So, I will take a guess! The problem is within your id2word, it should be a corpora.dictionary.Dictionary not just dict. So, you need to do the following:

>>> from gensim import corpora
>>>
>>> word2id = dict((k, v) for k, v in tfidfv.vocabulary_.items())
>>> d = corpora.Dictionary()
>>> d.id2token = id2word
>>> d.token2id = word2id
>>> #...
>>> # change `id2word` to `d`
>>> coherence_model_lda = CoherenceModel(model=lda, texts=df_lemmatized.long_title, dictionary=d, coherence='c_v')

And I think it should work just fine now!

Upvotes: 4

Related Questions