Reputation: 907
I used the gensim LDAModel for topic extraction for customer reviews as follows:
dictionary = corpora.Dictionary(clean_reviews)
dictionary.filter_extremes(keep_n=11000) #change filters
dictionary.compactify()
dictionary_path = "dictionary.dict"
corpora.Dictionary.save(dictionary, dictionary_path)
# convert tokenized documents to vectors
corpus = [dictionary.doc2bow(doc) for doc in clean_reviews]
vocab = lda.datasets.load_reuters_vocab()
# Training lda using number of topics set = 10 (which can be changed)
lda = gensim.models.LdaModel(corpus, id2word = dictionary,
num_topics = 20,
passes = 20,
random_state=1,
alpha = "auto")
This returns unigrams in topics like:
topic1 -delivery,parcel,location
topic2 -app, login, access
But I am looking for ngrams. I came across sklearn's LatentDirichletAllocation which uses Tfidf vectorizer as follows:
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=[2,5], stop_words='english', min_df=2)
X = vectorizer.fit_transform(new_review_list)
clf = decomposition.LatentDirichletAllocation(n_topics=20, random_state=3, doc_topic_prior = .1).fit(X)
where we can specify range for ngrams in the vectorizer. Is it possible to do so in the gensim LDA Model as well.
Sorry, I'm very new to using all these models, so don't know much about them.
Upvotes: 1
Views: 4203
Reputation: 11
I know this an old thread but I thought I will share what I did to get k-grams in topics. I wanted to include bi-grams, tri-grams, and quad-grams in my vocabulary. For this purpose, I used gensim's Phrases class and before running LDA model. Here is a really good resource.
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#15visualizethetopicskeywords
I have done something similar. Hope this helps
Upvotes: 1