Rizakha
Rizakha

Reputation: 131

Python LDA Gensim model with over 20 topics does not print properly

Using the Gensim package (both LDA and Mallet), I noticed that when I create a model with more than 20 topics, and I use the print_topics function, it will print a maximum of 20 topics (note, not the first 20 topics, rather any 20 topics), and they will be out of order.

And so my question is, how do i get all of the topics to print? I am unsure if this is a bug or an issue on my end. I have looked back at my library of LDA models (over 5000, different data sources), and have noted this happens in all of them where topics are above 20.

Below is sample code with output. In the output, you will see the topics are not ordered (they should be) and topics are missing such as topic 3.

lda_model = gensim.models.ldamodel.LdaModel(corpus=jr_dict_corpus,
                                           id2word=jr_dict,
                                           num_topics=25, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

pprint(lda_model.print_topics())
#note, if the model contained 20 topics, the topics would be listed in order 0-19
[(21,
  '0.001*"commitment" + 0.001*"study" + 0.001*"evolve" + 0.001*"outlook" + '
  '0.001*"value" + 0.001*"people" + 0.001*"individual" + 0.001*"client" + '
  '0.001*"structure" + 0.001*"proposal"'),
 (18,
  '0.001*"self" + 0.001*"insurance" + 0.001*"need" + 0.001*"trend" + '
  '0.001*"statistic" + 0.001*"propose" + 0.001*"analysis" + 0.001*"perform" + '
  '0.001*"impact" + 0.001*"awareness"'),
 (2,
  '0.001*"link" + 0.001*"task" + 0.001*"collegiate" + 0.001*"universitie" + '
  '0.001*"banking" + 0.001*"origination" + 0.001*"security" + 0.001*"standard" '
  '+ 0.001*"qualifications_bachelor" + 0.001*"greenfield"'),
 (11,
  '0.024*"collegiate" + 0.016*"interpersonal" + 0.016*"prepare" + '
  '0.016*"invite" + 0.016*"aspect" + 0.016*"college" + 0.016*"statistic" + '
  '0.016*"continent" + 0.016*"structure" + 0.016*"project"'),
 (10,
  '0.049*"enjoy" + 0.049*"ambiguity" + 0.017*"accordance" + 0.017*"liberalize" '
  '+ 0.017*"developing" + 0.017*"application" + 0.017*"vacancie" + '
  '0.017*"service" + 0.017*"initiative" + 0.017*"discontinuing"'),
 (20,
  '0.028*"negotiation" + 0.028*"desk" + 0.018*"enhance" + 0.018*"engage" + '
  '0.018*"discussion" + 0.018*"ability" + 0.018*"depth" + 0.018*"derive" + '
  '0.018*"enjoy" + 0.018*"balance"'),
 (12,
  '0.036*"individual" + 0.024*"validate" + 0.018*"greenfield" + '
  '0.018*"capability" + 0.018*"coordinate" + 0.018*"create" + '
  '0.018*"programming" + 0.018*"safety" + 0.010*"evaluation" + '
  '0.002*"reliability"'),
 (1,
  '0.028*"negotiation" + 0.021*"responsibility" + 0.014*"master" + '
  '0.014*"mind" + 0.014*"experience" + 0.014*"worker" + 0.014*"ability" + '
  '0.007*"summary" + 0.007*"proposal" + 0.007*"alert"'),
 (23,
  '0.043*"banking" + 0.026*"origination" + 0.026*"round" + 0.026*"credibility" '
  '+ 0.026*"entity" + 0.018*"standard" + 0.017*"range" + 0.017*"pension" + '
  '0.017*"adapt" + 0.017*"information"'),
 (13,
  '0.034*"priority" + 0.034*"reconciliation" + 0.034*"purchaser" + '
  '0.023*"reporting" + 0.023*"offer" + 0.023*"investor" + 0.023*"share" + '
  '0.023*"region" + 0.023*"service" + 0.023*"manipulate"'),
 (22,
  '0.017*"analyst" + 0.017*"modelling" + 0.016*"producer" + 0.016*"return" + '
  '0.016*"self" + 0.009*"scope" + 0.008*"mind" + 0.008*"need" + 0.008*"detail" '
  '+ 0.008*"statistic"'),
 (9,
  '0.021*"decision" + 0.014*"invite" + 0.014*"balance" + 0.014*"commercialize" '
  '+ 0.014*"transform" + 0.014*"manage" + 0.014*"optionality" + '
  '0.014*"problem_solving" + 0.014*"fuel" + 0.014*"stay"'),
 (7,
  '0.032*"commitment" + 0.032*"study" + 0.016*"impact" + 0.016*"outlook" + '
  '0.011*"operation" + 0.011*"expand" + 0.011*"exchange" + 0.011*"management" '
  '+ 0.011*"conde" + 0.011*"evolve"'),
 (15,
  '0.032*"agility" + 0.019*"feasibility" + 0.019*"self" + 0.014*"deploy" + '
  '0.014*"define" + 0.013*"investment" + 0.013*"option" + 0.013*"control" + '
  '0.013*"action" + 0.013*"incubation"'),
 (5,
  '0.020*"desk" + 0.018*"agility" + 0.016*"vender" + 0.016*"coordinate" + '
  '0.016*"committee" + 0.012*"acquisition" + 0.012*"target" + '
  '0.012*"counterparty" + 0.012*"approval" + 0.012*"trend"'),
 (17,
  '0.022*"option" + 0.017*"working" + 0.017*"niche" + 0.011*"business" + '
  '0.011*"constrain" + 0.011*"meeting" + 0.011*"correspond" + 0.011*"exposure" '
  '+ 0.011*"element" + 0.011*"face"'),
 (0,
  '0.025*"expertise" + 0.025*"banking" + 0.021*"universitie" + '
  '0.017*"spreadsheet" + 0.013*"negotiation" + 0.013*"shipment" + '
  '0.013*"arise" + 0.013*"billing" + 0.013*"assistance" + 0.013*"sector"'),
 (4,
  '0.024*"provide" + 0.017*"consider" + 0.017*"allow" + 0.015*"outlook" + '
  '0.015*"value" + 0.015*"contract" + 0.012*"study" + 0.012*"technology" + '
  '0.012*"scenario" + 0.012*"indicator"'),
 (6,
  '0.058*"impulse" + 0.027*"shall" + 0.027*"shape" + 0.024*"marketer" + '
  '0.017*"availability" + 0.014*"determine" + 0.014*"load" + '
  '0.014*"constantly_change" + 0.014*"instrument" + 0.014*"interface"'),
 (19,
  '0.042*"task" + 0.038*"tariff" + 0.038*"recommend" + 0.024*"example" + '
  '0.023*"future" + 0.021*"people" + 0.021*"math" + 0.021*"capacity" + '
  '0.021*"spirit" + 0.020*"price"')]

Same model as above, but using 20 topics. As you can see, the output is in order by topic # and it contains all the topics.

lda_model = gensim.models.ldamodel.LdaModel(corpus=jr_dict_corpus,
                                           id2word=jr_dict,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

pprint(lda_model.print_topics())

[(0,
  '0.031*"enjoy" + 0.031*"ambiguity" + 0.028*"accordance" + 0.016*"statistic" '
  '+ 0.016*"initiative" + 0.016*"service" + 0.016*"liberalize" + '
  '0.016*"application" + 0.011*"community" + 0.011*"identifie"'),
 (1,
  '0.016*"transformation" + 0.016*"negotiation" + 0.016*"community" + '
  '0.016*"clock" + 0.011*"marketer" + 0.011*"desk" + 0.011*"mandate" + '
  '0.011*"closing" + 0.011*"initiative" + 0.011*"experience"'),
 (2,
  '0.026*"priority" + 0.026*"reconciliation" + 0.026*"purchaser" + '
  '0.020*"safety" + 0.020*"region" + 0.020*"query" + 0.020*"share" + '
  '0.020*"manipulate" + 0.020*"ibex" + 0.020*"investor"'),
 (3,
  '0.022*"improve" + 0.021*"committee" + 0.021*"affect" + 0.012*"target" + '
  '0.012*"acquisition" + 0.011*"basis" + 0.011*"profitability" + '
  '0.011*"economic" + 0.011*"natural" + 0.011*"profit"'),
 (4,
  '0.024*"provide" + 0.019*"value" + 0.017*"consider" + 0.017*"allow" + '
  '0.015*"scenario" + 0.015*"outlook" + 0.015*"contract" + 0.014*"forecast" + '
  '0.014*"decision" + 0.012*"indicator"'),
 (5,
  '0.037*"desk" + 0.030*"coordinate" + 0.030*"agility" + 0.030*"vender" + '
  '0.023*"counterparty" + 0.023*"immature_emerge" + 0.023*"metric" + '
  '0.022*"approval" + 0.015*"maximization" + 0.015*"undergraduate"'),
 (6,
  '0.053*"impulse" + 0.025*"shall" + 0.025*"shape" + 0.018*"availability" + '
  '0.018*"marketer" + 0.012*"determine" + 0.012*"language" + '
  '0.012*"monitoring" + 0.012*"integration" + 0.012*"month"'),
 (7,
  '0.026*"commitment" + 0.026*"study" + 0.013*"impact" + 0.013*"outlook" + '
  '0.009*"operation" + 0.009*"management" + 0.009*"expand" + 0.009*"exchange" '
  '+ 0.009*"conde" + 0.009*"balance"'),
 (8,
  '0.057*"insurance" + 0.029*"propose" + 0.028*"rule" + 0.026*"self" + '
  '0.023*"product" + 0.023*"asset" + 0.023*"pricing" + 0.023*"amount" + '
  '0.023*"result" + 0.020*"liquidity"'),
 (9,
  '0.012*"universitie" + 0.012*"need" + 0.012*"statistic" + 0.012*"trend" + '
  '0.008*"invite" + 0.008*"commercialize" + 0.008*"transform" + 0.008*"manage" '
  '+ 0.008*"problem_solving" + 0.008*"optionality"'),
 (10,
  '0.024*"background" + 0.024*"curve" + 0.020*"allow" + 0.019*"collect" + '
  '0.019*"basis" + 0.017*"accordance" + 0.013*"improve" + 0.013*"datum" + '
  '0.013*"component" + 0.013*"reliability"'),
 (11,
  '0.054*"task" + 0.049*"tariff" + 0.049*"recommend" + 0.031*"future" + '
  '0.027*"spirit" + 0.027*"capacity" + 0.027*"math" + 0.022*"ensure" + '
  '0.022*"profit" + 0.022*"variable_margin"'),
 (12,
  '0.001*"impulse" + 0.001*"availability" + 0.001*"reliability" + '
  '0.001*"shall" + 0.001*"component" + 0.001*"agent" + 0.001*"marketer" + '
  '0.001*"shape" + 0.001*"assisting" + 0.001*"supply"'),
 (13,
  '0.021*"region" + 0.016*"greenfield" + 0.016*"collegiate" + 0.011*"transfer" '
  '+ 0.011*"remuneration" + 0.011*"organization" + 0.011*"structure" + '
  '0.011*"continent" + 0.011*"project" + 0.011*"prepare"'),
 (14,
  '0.033*"originator" + 0.025*"vender" + 0.025*"expertise" + 0.025*"banking" + '
  '0.019*"evolve" + 0.017*"management" + 0.017*"market" + 0.017*"site" + '
  '0.012*"component" + 0.012*"discontinuing"'),
 (15,
  '0.027*"agility" + 0.022*"mind" + 0.022*"negotiation" + 0.011*"deploy" + '
  '0.011*"define" + 0.011*"ecosystem" + 0.011*"control" + 0.011*"lead" + '
  '0.011*"industry" + 0.011*"option"'),
 (16,
  '0.001*"region" + 0.001*"master" + 0.001*"orginiation" + 0.001*"greenfield" '
  '+ 0.001*"agent" + 0.001*"identifie" + 0.001*"remuneration" + 0.001*"mark" + '
  '0.001*"reviewing" + 0.001*"closing"'),
 (17,
  '0.030*"banking" + 0.018*"option" + 0.018*"round" + 0.018*"credibility" + '
  '0.018*"origination" + 0.018*"entity" + 0.016*"working" + 0.015*"niche" + '
  '0.015*"standard" + 0.012*"coordinate"'),
 (18,
  '0.027*"negotiation" + 0.018*"reporting" + 0.018*"perform" + 0.018*"world" + '
  '0.015*"offer" + 0.015*"manipulate" + 0.011*"query" + 0.010*"control" + '
  '0.010*"working" + 0.009*"self"'),
 (19,
  '0.047*"example" + 0.039*"people" + 0.039*"price" + 0.039*"excel" + '
  '0.039*"excellent" + 0.038*"base" + 0.031*"office" + 0.031*"optimizing" + '
  '0.031*"participate" + 0.031*"package"')]

Upvotes: 2

Views: 1404

Answers (2)

user14462764
user14462764

Reputation: 153

print(lda_model.print_topics(num_topics=25, num_words=10))

Upvotes: 0

Rizakha
Rizakha

Reputation: 131

The default number of topics for print_topics is 20. You must use the num_topics argument to include topics above 20...

Upvotes: 2

Related Questions