Reputation: 1017
I am trying to do some classification on customer emails.
I am using Python3 and think I have to use nltk and scikit NLTK - will help understand and read the text I beleive scikit - will do the classification (happy, sad and billing or not)
Training data set 1: A few phrases...anywhere from one word to a sentence with 5 to 6 words. (1 being happy and 0 being not happy)...a few examples below
Training data set 2: a few phrases indicating billing related question..(few examples below)
Now this seems to be straight forward from a concept stand point where can I find some basic code, that will tell me
Upvotes: 0
Views: 1528
Reputation: 1731
Regarding your data sets, your approach is nearly lexicon-based as the items contains very few words.
For billing, the lexicon-based approach should be a good idea. You should give importance to the subjects of the emails.
For sentiment analysis you have two options:
Machine learning: In this case you should use a bigger data set (in my view, each item should be a full email). You can implement a Naive Bayes classifier following this tutorial.
Lexicon-based approach: There are several lexicons for sentiment analysis e.g. SentiWordNet (downloadable from nltk.download()
), MPQA, SentiStrength, WordNet-Affect via WNAffect,... Preprocessings: tokenization (nltk.word_tokenize()
) and POS tagging (nltk.pos_tag(text)
). You should also think about negation (polarity shifting is a good approach to manage with negation).
Machine Learning provide best results so if you have enough annotated emails it is the good choice.
Upvotes: 3