Reputation: 9844
I'm trying to calculate the coherence score after using BERTopic modelling to discover topics from an input text. I'm facing this error though "unable to interpret topic as either a list of tokens or a list of ids"
, and I'm not sure why.
This is how I get the tokens and topics words:
from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
topic_model = BERTopic(n_gram_range=(2, 3), min_topic_size=5)
topics, _ = topic_model.fit_transform(docs)
cleaned_docs = topic_model._preprocess_text(docs)
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topics = topic_model.get_topics()
topics.pop(-1, None)
topic_words = [
[word for word, _ in topic_model.get_topic(topic) if word != ""] for topic in topics
]
topic_words = [[words for words, _ in topic_model.get_topic(topic)]
for topic in range(len(set(topics))-1)]
# Evaluate
coherence_model = CoherenceModel(topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='c_v')
coherence = coherence_model.get_coherence()
It fails here:
def _ensure_elements_are_ids(self, topic):
ids_from_tokens = [self.dictionary.token2id[t] for t in topic if t in self.dictionary.token2id]
ids_from_ids = [i for i in topic if i in self.dictionary]
if len(ids_from_tokens) > len(ids_from_ids):
return np.array(ids_from_tokens)
elif len(ids_from_ids) > len(ids_from_tokens):
return np.array(ids_from_ids)
else:
raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')
It seems that something weird is happening in the topic_words
step. I'm getting words that don't exist in the data and I don't understand why. To test it out, I set manually docs as a given list of strings.
For instance:
logger.info(id2word.token2id[t])
KeyError: 'calendar happy'
I don't have any entry for calendar happy
in docs. But I see it when I log the topic words:
logger.info(topic_words)
[['new year', 'chinese new year', 'chinese new', 'calendar happy', '2023 chinese new', '2023 chinese', 'new month', 'happy new month', 'happy new year', 'monthly calendar happy']...
I'm not sure how this can be and I see this is how people use to evaluate Bertopic, for instance: https://www.theanalyticslab.nl/topic-modeling-with-bertopic/
Upvotes: 3
Views: 3363
Reputation: 243
I met the same error, which was caused by the empty topic words. Some topics may have empty top N words for some reasons. Delete such empty topics helped solve this problem in my case.
Upvotes: 2