How to implement Latent Dirichlet Allocation to give bigrams/trigrams in topics instead of unigrams

Question

I used the gensim LDAModel for topic extraction for customer reviews as follows:

dictionary = corpora.Dictionary(clean_reviews)
dictionary.filter_extremes(keep_n=11000) #change filters
dictionary.compactify()
dictionary_path = "dictionary.dict"
corpora.Dictionary.save(dictionary, dictionary_path)

# convert tokenized documents to vectors

corpus = [dictionary.doc2bow(doc) for doc in clean_reviews]
vocab = lda.datasets.load_reuters_vocab()  

# Training lda using number of topics set = 10 (which can be changed)

lda = gensim.models.LdaModel(corpus, id2word = dictionary,
                        num_topics = 20,
                        passes = 20,
                        random_state=1,
                        alpha = "auto")

This returns unigrams in topics like:

topic1 -delivery,parcel,location

topic2 -app, login, access

But I am looking for ngrams. I came across sklearn's LatentDirichletAllocation which uses Tfidf vectorizer as follows:

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=[2,5], stop_words='english', min_df=2)    
X = vectorizer.fit_transform(new_review_list)
clf = decomposition.LatentDirichletAllocation(n_topics=20, random_state=3, doc_topic_prior = .1).fit(X)

where we can specify range for ngrams in the vectorizer. Is it possible to do so in the gensim LDA Model as well.

Sorry, I'm very new to using all these models, so don't know much about them.

How to implement Latent Dirichlet Allocation to give bigrams/trigrams in topics instead of unigrams

Answers (1)

Related Questions