pceccon
pceccon

Reputation: 9844

Topic Modelling Coherence Score:

I'm trying to calculate the coherence score after using BERTopic modelling to discover topics from an input text. I'm facing this error though "unable to interpret topic as either a list of tokens or a list of ids", and I'm not sure why.

This is how I get the tokens and topics words:

 from bertopic import BERTopic
 import gensim.corpora as corpora
 from gensim.models.coherencemodel import CoherenceModel

 topic_model = BERTopic(n_gram_range=(2, 3), min_topic_size=5)
 topics, _ = topic_model.fit_transform(docs)
 cleaned_docs = topic_model._preprocess_text(docs)
 vectorizer = topic_model.vectorizer_model
 analyzer = vectorizer.build_analyzer()
 tokens = [analyzer(doc) for doc in cleaned_docs]
 dictionary = corpora.Dictionary(tokens)
 corpus = [dictionary.doc2bow(token) for token in tokens]
 topics = topic_model.get_topics()
 topics.pop(-1, None)
 topic_words = [
    [word for word, _ in topic_model.get_topic(topic) if word != ""] for topic in topics
 ]
 topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
            for topic in range(len(set(topics))-1)]

 # Evaluate
 coherence_model = CoherenceModel(topics=topic_words, 
                              texts=tokens, 
                              corpus=corpus,
                              dictionary=dictionary, 
                              coherence='c_v')
 coherence = coherence_model.get_coherence()

It fails here:

    def _ensure_elements_are_ids(self, topic):
        ids_from_tokens = [self.dictionary.token2id[t] for t in topic if t in self.dictionary.token2id]
        ids_from_ids = [i for i in topic if i in self.dictionary]
        if len(ids_from_tokens) > len(ids_from_ids):
            return np.array(ids_from_tokens)
        elif len(ids_from_ids) > len(ids_from_tokens):
            return np.array(ids_from_ids)
        else:
            raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')

It seems that something weird is happening in the topic_words step. I'm getting words that don't exist in the data and I don't understand why. To test it out, I set manually docs as a given list of strings.

For instance:

logger.info(id2word.token2id[t])
KeyError: 'calendar happy'

I don't have any entry for calendar happy in docs. But I see it when I log the topic words:

logger.info(topic_words)
[['new year', 'chinese new year', 'chinese new', 'calendar happy', '2023 chinese new', '2023 chinese', 'new month', 'happy new month', 'happy new year', 'monthly calendar happy']...

I'm not sure how this can be and I see this is how people use to evaluate Bertopic, for instance: https://www.theanalyticslab.nl/topic-modeling-with-bertopic/

Upvotes: 3

Views: 3363

Answers (1)

lovemyday
lovemyday

Reputation: 243

I met the same error, which was caused by the empty topic words. Some topics may have empty top N words for some reasons. Delete such empty topics helped solve this problem in my case.

Upvotes: 2

Related Questions