gython
gython

Reputation: 875

Why does the LDA gensim implemention need the corpus and a dictionary?

I am going through the gensim LDA implementation and it says it needs a corpus and a dictionary of the corpus?

https://radimrehurek.com/gensim/models/ldamodel.html

What is the reason for this?

Upvotes: 0

Views: 492

Answers (1)

Peritract
Peritract

Reputation: 769

Gensim uses the dictionary to create the bag-of-words models that form the corpus.

# Make the dictionary from your texts
common_dictionary = Dictionary(common_texts)

# Use the dictionary to generate the corpus (set of bag-of-words models)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

You can then use that dictionary again to generate a new but similar corpus from unseen texts.

other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]

You need the dictionary to have the corpus, as the corpus is made from documents converted to bag-of-words, and a dictionary is required for building bag-of-words. Other implementations of the bag-of-words model (such as sklearn's CountVectoriser) hide the dictionary from you, but it's still there.

Upvotes: 2

Related Questions