wolfgang5
wolfgang5

Reputation: 51

Text Processing, How to assign 1 topic -> 1 document using LDA?

I have 2 files,

music.txt & science.txt

I'd lke to extract 2 topics from the above (Music , Science)

After creating the LDA model from these 2 files (setting num_topics=2)

lda = gensim.models.ldamodel.LdaModel(corpus=my_corpus, id2word=corpus_dictionary, num_topics=2)

print(lda.print_topic(0))
print(lda.print_topic(1))

This is my output

0.011*scientific + 0.010*musical + 0.007*music, + 0.006*music. + 0.006*study + 0.005*not + 0.005*research + 0.005*main

0.030*music + 0.013*science + 0.010*scientific + 0.009*musical + 0.006*not + 0.005*music. + 0.005*study + 0.005*music, + 0.005*their + 0.005*research

As you can see, both science and music are present in each of the 2 topics

I'd like to

  1. Use music.txt and create 1 topic Music LDA model
  2. Use science.txt and create 1 topic Science LDA model
  3. Combine the above 2 LDA models to give 1 LDA model with the above 2 topics

is the above 3rd step possible? I'd like to have individual segregration of topics in my LDA model. If not, is there any alternative?

Upvotes: 1

Views: 371

Answers (1)

yvespeirsman
yvespeirsman

Reputation: 3099

There are two things you can do:

1) If your documents really contain texts that are exclusively about music or science, it is strange that the LDA topics give such a mixed result. Trying to improve the model may be worthwhile. You may consider dropping stopwords, ignoring low-frequency words, and so on.

2) However, the method that you're really looking for is so-called labeled LDA. With Labeled LDA, you train your model on documents that have already been labeled with the target topics, rather than having the model infer the most appropriate topics itself. As far as I know, labeled LDA has not been implemented in gensim, but you'll find it in the Stanford Topic Modeling Toolkit, among other places.

Upvotes: 1

Related Questions