Reputation: 875
I am going through the gensim
LDA implementation and it says it needs a corpus and a dictionary of the corpus?
https://radimrehurek.com/gensim/models/ldamodel.html
What is the reason for this?
Upvotes: 0
Views: 492
Reputation: 769
Gensim uses the dictionary to create the bag-of-words models that form the corpus.
# Make the dictionary from your texts
common_dictionary = Dictionary(common_texts)
# Use the dictionary to generate the corpus (set of bag-of-words models)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
You can then use that dictionary again to generate a new but similar corpus from unseen texts.
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
You need the dictionary to have the corpus, as the corpus is made from documents converted to bag-of-words, and a dictionary is required for building bag-of-words. Other implementations of the bag-of-words model (such as sklearn
's CountVectoriser) hide the dictionary from you, but it's still there.
Upvotes: 2