Mayur Kulkarni
Mayur Kulkarni

Reputation: 1316

How to increase the speed for SVM classifier using Sk-learn

I'm trying to build a spam mail classifier and I've collected multiple datasets from over internet(eg. SpamAssassin Database for spam/ham mails) and built this :

import os
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score
from sklearn import svm

NEWLINE = '\n'

HAM = 'ham'
SPAM = 'spam'

SOURCES = [
    ('C:/data/spam', SPAM),
    ('C:/data/easy_ham', HAM),
    # ('C:/data/hard_ham', HAM), Commented out, since they take too long
    # ('C:/data/beck-s', HAM),
    # ('C:/data/farmer-d', HAM),
    # ('C:/data/kaminski-v', HAM),
    # ('C:/data/kitchen-l', HAM),
    # ('C:/data/lokay-m', HAM),
    # ('C:/data/williams-w3', HAM),
    # ('C:/data/BG', SPAM),
    # ('C:/data/GP', SPAM),
    # ('C:/data/SH', SPAM)
]

SKIP_FILES = {'cmds'}


def read_files(path):
    for root, dir_names, file_names in os.walk(path):
        for path in dir_names:
            read_files(os.path.join(root, path))
        for file_name in file_names:
            if file_name not in SKIP_FILES:
                file_path = os.path.join(root, file_name)
                if os.path.isfile(file_path):
                    past_header, lines = False, []
                    f = open(file_path, encoding="latin-1")
                    for line in f:
                        if past_header:
                            lines.append(line)
                        elif line == NEWLINE:
                            past_header = True
                    f.close()
                    content = NEWLINE.join(lines)
                    yield file_path, content


def build_data_frame(path, classification):
    rows = []
    index = []
    for file_name, text in read_files(path):
        rows.append({'text': text, 'class': classification})
        index.append(file_name)

    data_frame = DataFrame(rows, index=index)
    return data_frame


data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
    data = data.append(build_data_frame(path, classification))

data = data.reindex(numpy.random.permutation(data.index))

pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', svm.SVC(gamma=0.001, C=100))
])

k_fold = KFold(n=len(data), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
    train_text = data.iloc[train_indices]['text'].values
    train_y = data.iloc[train_indices]['class'].values.astype(str)

    test_text = data.iloc[test_indices]['text'].values
    test_y = data.iloc[test_indices]['class'].values.astype(str)

    pipeline.fit(train_text, train_y)
    predictions = pipeline.predict(test_text)

    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=SPAM)
    scores.append(score)

print('Total emails classified:', len(data))
print('Support Vector Machine Output : ')
print('Score:' + str((sum(scores) / len(scores))*100) + '%')
print('Confusion matrix:')
print(confusion)

The lines which I've commented out are the collection of mails, even if I comment out most of the datasets and select the one with least amount of mails, it still runs extremely slow (~15minutes) and give accuracy of about 91%. How do I improve the speed and accuracy?

Upvotes: 3

Views: 4470

Answers (1)

David Maust
David Maust

Reputation: 8270

You are using kernel SVM. There are two problems with this.

Running Time Complexity of Kernel SVM: The first step in performing kernel SVM is building a similarity matrix, which becomes the feature set. With 30,000 documents, the number of elements in the similarity matrix becomes 90,000,000. This grows quickly as your corpus grows since the matrix grows the square of the number of documents in your corpus. This problem could be solved using using RBFSampler in scikit-learn, but you probably don't want to use that, for the next reason.

Dimensionality: You are using term and bigram counts as your feature set. This is an extremely high dimensional dataset. Using an RBF kernel in high dimensional spaces, even small differences (noise) can create a large impact in similarity results. See the curse of dimensionality. This is likely why your RBF kernel yields worse results than a linear kernel.

Stochastic Gradient Descent: SGD can be used instead of the standard SVM, and with good parameter tuning it may yield similar or possibly even better results. The drawback is SGD has more parameters to tune regarding the learning rate and learning rate schedule. Also, for few passes SGD is not ideal. In that case other algorithms like Follow The Regularized Leader (FTRL) will do better. Scikit-learn does not implement FTRL though. Using SGDClassifier with loss="modified_huber" often works well.

Now that we have the problems out of the way, there are several ways you can improve performance:

tf-idf weights: Using tf-idf, more common words are weighted less. This allows the classifier to better represent rare words that are more meaningful. This can be implemented by switching CountVectorizer to TfidfVectorizer

Parameter tuning: With linear SVM, there is no gamma parameter, but the C parameter can be used to greatly improve results. In the case of SGDClassifier, the alpha and learning rate parameters can be tuned as well.

ensembling: Running your model on multiple subsamples and averaging the result will often produce a robust model than a single run. This can be done in scikit-learn using the BaggingClassifier. Also combining different approaches can produce significantly better results. If substantially different approaches are used, consider using a stacked model with a tree model (RandomForestClassifier or GradientBoostingClassifier) as the last stage.

Upvotes: 3

Related Questions