user2882900
user2882900

Reputation:

Text Mining with SVM Classifier

I want to apply SVM classification for text-mining purpose using python nltk and get precision, recall accuracy different measurement information.For doing this, I preprocess dataset and split my dataset into two text files namely-pos_file.txt (positive label) and neg_file.txt (negative label). And now I want to apply SVM classifier with Random Sampling 70% for training the data and 30% for testing. I saw some documentation of scikit-learn, but not exactly sure how I shall apply this?

Both pos_file.txt and neg_file.txt are can be considered as bag of words. Useful links-

Sample files: pos_file.txt

stackoverflowerror restor default properti page string present
multiprocess invalid assert fetch process inform
folderlevel discoveri option page seen configur scope select project level

Sample files: neg_file.txt

class wizard give error enter class name alreadi exist
unabl make work linux
eclips crash
semant error highlight undeclar variabl doesnt work

And furthermore it would be interesting to apply the same approach for unigram, bigram and trigram. Looking forward your suggestion or sample code.

Upvotes: 3

Views: 7064

Answers (1)

Moses Xu
Moses Xu

Reputation: 2160

Below is a very rough guideline of applying SVM to text classification:

  1. Converting your texts into vector representations, i.e. numericalize texts so SVM (and most other machine learners) can be applied. This can be done quite easily using sklearn.feature_extraction.CountVectorizer/TfidfVectorizer, and you can freely select your n-gram range during vectorization along with all other options, such as stop word elimination and word document frequency thresholding
  2. Performing feature selection, which is usually optional as SVM's are shown to handle feature redundancy well. However, feature selection can help shrink the learning space dimensionality and speed up training significantly. Common choices are: sklearn.feature_selection.chi2/SelectKBest, just to name a few
  3. Fitting (training) an SVC to your training data. Various choices of kernels are at your disposal and for learner parameters such as C and gamma, you could leave them default for initial play around. If your goal is to obtain the best possible performance, you can use grid search (sklearn.grid_search), which tries exhaustively the parameter combinations you specify and shows you the combination that yields the best results. The grid search is usually performed on the evaluation data
  4. Evaluation. After fine-tuned your learner parameters on your evaluation data, you can test your fitted SVM's performance on the testing data that's previously unseen in the training and fine-tuning stages. Alternatively, you can use n-cross validation (sklearn.cross_validation) to estimate your SVM's performance. If you have a limited amount of annotated texts, n-cross validation is recommended as it takes advantage of using all the data you have

The following sklearn documentation is a really good example of performing text classification in the sklearn framework, which I would recommend as a starting point:

Classification of text documents using sparse features

Upvotes: 8

Related Questions