Reputation: 51
I have 2 files,
music.txt & science.txt
I'd lke to extract 2 topics from the above (Music , Science)
After creating the LDA model from these 2 files (setting num_topics=2)
lda = gensim.models.ldamodel.LdaModel(corpus=my_corpus, id2word=corpus_dictionary, num_topics=2)
print(lda.print_topic(0))
print(lda.print_topic(1))
This is my output
0.011*scientific + 0.010*musical + 0.007*music, + 0.006*music. + 0.006*study + 0.005*not + 0.005*research + 0.005*main
0.030*music + 0.013*science + 0.010*scientific + 0.009*musical + 0.006*not + 0.005*music. + 0.005*study + 0.005*music, + 0.005*their + 0.005*research
As you can see, both science and music are present in each of the 2 topics
I'd like to
is the above 3rd step possible? I'd like to have individual segregration of topics in my LDA model. If not, is there any alternative?
Upvotes: 1
Views: 371
Reputation: 3099
There are two things you can do:
1) If your documents really contain texts that are exclusively about music or science, it is strange that the LDA topics give such a mixed result. Trying to improve the model may be worthwhile. You may consider dropping stopwords, ignoring low-frequency words, and so on.
2) However, the method that you're really looking for is so-called labeled LDA. With Labeled LDA, you train your model on documents that have already been labeled with the target topics, rather than having the model infer the most appropriate topics itself. As far as I know, labeled LDA has not been implemented in gensim, but you'll find it in the Stanford Topic Modeling Toolkit, among other places.
Upvotes: 1