Reputation: 43
I have a collection of comments and each comment discusses a topic. I want to figure out the top m topics discussed in these comments. Also, I am receiving these comments in an online fashion(i.e. I don't get the entire comments in one go, instead I have to process these comments one-by-one). I thought of using Word2Vec for feature extraction and then applying some clustering algorithm like k-means(cluster would correspond to a topic) and then I can get the answer from the top m clusters(which have most number of points in them). But the problem is that I don't know the number of clusters and also at any point of time, the number of different topics(clusters) is not fixed because a new comment might discuss a new topic(so, this problem can't be solved by applying k-means with different values of k). So, should I use some other clustering algorithm(like DBSCAN) and what should be the approach in that case or should I use a totally different approach?
Upvotes: 0
Views: 62
Reputation: 870
Why can't you try something simple LDA and start with a large number for topics and then narrow it down? https://radimrehurek.com/gensim/models/ldamodel.html
On a similar note you can take a look at sense2vec where they used reddit comments to build a topic model https://explosion.ai/blog/sense2vec-with-spacy
Upvotes: 1