poon
poon

Reputation: 1

gensim CoherenceModel gives "ValueError: unable to interpret topic as either a list of tokens or a list of ids"

I was trying to tune the hyperparameters min_topic_size and top_n_words for my BERTopic topic models. I kept running against the error ""ValueError: unable to interpret topic as either a list of tokens or a list of ids" when evaluating a certain set of values for the parameters. Some pairs of values seem to work fine, while some don't. For instance, when min_topic_size =20 and top_n_word=5, it just failed to give the score. While some other time with different values, it worked.The text file i used is here abs text file.

I have no clue what seems to be an issue here.

from bertopic import BERtopic 
from umap import UMAP
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel


umap_model = UMAP(n_neighbors=15, n_components=5, 
                  min_dist=0.5, metric='cosine', random_state=42)
abs=df.abstract.to_list()
yr=df.year.to_list()

#Hyperparametre tuning : top_n_words and min_topic_size 

def bert_coh(model,docs):
    score=[]
    cleaned_docs=model._preprocess_text(docs)
    vectorizer=model.vectorizer_model
    tokenizer = vectorizer.build_tokenizer()
    words = vectorizer.get_feature_names()
    tokens=[tokenizer(doc) for doc in cleaned_docs]
    dictionary =corpora.Dictionary(tokens)
    corpus=[dictionary.doc2bow(token) for token in tokens]
    topic_words = [[words for words, _ in model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]
    uci = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_uci')
    umass= CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='u_mass')
    npmi = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_npmi')
    for obj in (uci,umass,npmi):
        coherence = obj.get_coherence()
        score.append(coherence)
    return score
#training model
#use abs at the abs text file 
model=BERTopic(top_n_words=5,umap_model=umap_model,min_topic_size=20,calculate_probabilities=True,
                          n_gram_range=(1,3),low_memory=True,verbose=True,language='multilingual')
topics,_ =model.fit_transforms(abs) 
bert_coh(model,abs)

Upvotes: 0

Views: 2448

Answers (1)

Dhaval Kanani
Dhaval Kanani

Reputation: 21

  • Use the build_analyzer() instead of build_tokenizer() which allows for n-gram tokenization

  • Preprocessing is now based on a collection of documents per topic, since the CountVectorizer was trained on that data

     from bertopic import BERTopic
     import gensim.corpora as corpora
     from gensim.models.coherencemodel import CoherenceModel
    
     topic_model = BERTopic(verbose=True, n_gram_range=(1, 3))
     topics, _ = topic_model.fit_transform(docs)
    
     # Preprocess Documents
     documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
     cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)
    
     # Extract vectorizer and analyzer from BERTopic
     vectorizer = topic_model.vectorizer_model
     analyzer = vectorizer.build_analyzer()
    
     # Extract features for Topic Coherence evaluation
     words = vectorizer.get_feature_names()
     tokens = [analyzer(doc) for doc in cleaned_docs]
     dictionary = corpora.Dictionary(tokens)
     corpus = [dictionary.doc2bow(token) for token in tokens]
     topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
                for topic in range(len(set(topics))-1)]
    
     # Evaluate
     coherence_model = CoherenceModel(topics=topic_words, 
                                  texts=tokens, 
                                  corpus=corpus,
                                  dictionary=dictionary, 
                                  coherence='c_v')
     coherence = coherence_model.get_coherence()
    

For more issues about Coherence of topic models refer this link

Upvotes: 2

Related Questions