Prajwal
Prajwal

Reputation: 83

How to classify a sentence into one of the pre-defined topic bucket using an unsupervised approach

I am working on a project to classify customer feedback into buckets based on the topic of the feedback comment. So , I need to classify the sentence into one of the topics among a list of pre-defined topics.

For example :

"I keep getting an error message every time I log in" has to be tagged with "login" as the topic.

"make the screen more colorful" has to be tagged with "improvements" as the topic.

So the topics are very specific to the product and the context.

LDA doesn't seem to work for me(correct me if i'm wrong). It detects the topics in a general sense like "Sports" , "Politics" , "Technology" etc. But I need to detect specific topics as mentioned above.

Also , I don't have labelled data for training. All I have is the comments. So supervised learning approach doesn't look like an option.

What I have tried so far:

I trained a gensim model with google news corpus (its about 3.5 gb). I am cleaning the sentence by removing stop words , punctuation marks etc. I am finding , to what topic among the set of topics each word is closest to and tag the word to that topic. With an idea that the sentence might contain more words closer to the topic it is referring to than not , I am picking up the topic(s) to which maximum number of words in the sentence is mapped to.

For example:

If 3 words in a sentence is mapped to "login" topic and 2 words in the sentence is mapped to "improvement" topic , I am tagging the sentence to "login" topic.

If there is a clash between the count of multiple topics , I return all the topics with the maximum count as the topic list.

This approach is giving me fair results. But its not good enough.

What will be the best approach to tackle this problem?

Upvotes: 4

Views: 2442

Answers (2)

Vikas Goyal
Vikas Goyal

Reputation: 1

If the number of topics are manageable, i would suggest that you label some data for each topic and create supervised model. After that use multi-class classification to identity topics for other part of corpus. You can try something like LUIS

Upvotes: 0

VnC
VnC

Reputation: 2026

You would need to clean up the vector space properly (this is one of the most important things for this kind of problem), e.g. remove digits (that don't make sense), remove gibberish and experiment with the number of n-grams.

Check this article https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730 it is a very good description of LDA and NMF, together with some code snippets that might come in handy.

However, I would tackle the problem in the following way:

  1. Train a word2vec or doc2vec (experiment with both) not only with Google corpus, but add your data as well. FastText skip-grams prove useful as well.
  2. Get the unsupervised approach with general topics.
  3. Manually put labels to clusters.
  4. Add another classifier on top of that, which will use the classified examples as a training set and predict the topic.
  5. Start classifying your comments, so you are able to use a supervised approach soon enough.

However, potentially you would like to label a document with more that one topic, so you shouldn't really tag a sentence with login if 3 words are mapped to login and 2 to improvement (IMO). Rather, something like a multiclass classification login - 60% and improvement 40% seems more sensible.

It sounds as an exiting project you are working on. Good luck!

Upvotes: 1

Related Questions