Reputation: 1661
I have a data set of tweets that have keywords in them relating to vaccine perception. These include words like
[jab, shot, measles, MMR, vaccine, autism,...]
.
I would like to be able to classify a new tweet as either Pro-vaccine, anti-vaccine, or neither. I understand Naive Bayes is one way to do this.
I would rather use SKlearns library to implement the classification algorithm since those algs are more robust than what I can write.
How can I implement Naive Bayes? From Sklearn's website, it seems my choices are multinomial and gaussian, but I'm not sure which to use.
Upvotes: 0
Views: 1130
Reputation: 663
The below is a simple implementation of a classifier that classifies 5 diseases.
It has two files:
Train file (train.txt)
Test file (test.txt)
Basically, as per your question you should have your tweets in the Train file. And the tweets you want to classify in the Test file.
[Note: You can also use a CSV or JSON representation to load your data set, for the sake of simplicity I have used a text file.]
Content of Train file: [ train.txt ]
A highly contagious virus spread by coughing, sneezing or direct contact with skin lesions.
A contagious liver disease often caused by consuming contaminated food or water. It is the most common vaccine-preventable travel disease.
A serious liver disease spread through contact with blood or body fluids. The hepatitis B virus can cause liver cancer and possible death.
A group of over 100 viruses that spreads through sexual contact. HPV strains may cause genital warts and lead to cervical cancer.
A potentially fatal bacterial infection that strikes an average of 1,500 Americans annually.
Content of Test file: [ test.txt ]
died due to liver cancer.
Classification code: [ classifier.py ]
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['CHICKEN POX','HEPATITIS A','HEPATITIS B','Human papillomavirus','MENINGITIS']
mnb = MultinomialNB()
mnb.fit(trainset, tags)
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results
Upvotes: 1