Rudy
Rudy

Reputation: 446

Retraining a Persistent SVM Model with New Data in Scikit-Learn (Python 3)

I'm working on a machine-learning program in Python using Scikit-Learn that will sort emails based on their contents into categories of issue-types. e.g.: Someone emails me saying "This program is not launching", and the machine categorizes it as "Crash Issue".

I'm using an SVM algorithm which reads email contents and their respective category labels from 2 CSV files. I've written two programs:

  1. The first program trains the machine and exports the trained model using joblib.dump() so that the trained model can be used by the second program
  2. The second program makes predictions by importing the trained model. I want the second program to be able to update the trained model by re-fitting the classifier with new data taken in. But I'm not sure how to accomplish this. The prediction program asks for the user to type an email at it, and it will then make a prediction. It will then ask the user whether or not its prediction was correct. In both cases, I'd like the machine to learn from the outcome.

Training Program:

import numpy as np
import pandas as pd
from pandas import DataFrame
import os
from sklearn import svm
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib


###### Extract and Vectorize the features from each email in the Training Data ######
features_file = "features.csv" #The CSV file that contains the descriptions of each email. Features will be extracted from this text data
features_df = pd.read_csv(features_file, encoding='ISO-8859-1') 
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(features_df['Description'].values.astype('U')) #The sole column in the CSV file is labeled "Description", so we specify that here


###### Encode the class Labels of the Training Data ######
labels_file = "labels.csv" #The CSV file that contains the classification labels for each email
labels_df = pd.read_csv(labels_file, encoding='ISO-8859-1')
lab_enc = preprocessing.LabelEncoder()
labels = lab_enc.fit_transform(labels_df)


###### Create a classifier and fit it to our Training Data ######
clf = svm.SVC(gamma=0.01, C=100)
clf.fit(features, labels)


###### Output persistent model files ######
joblib.dump(clf, 'brain.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')
joblib.dump(lab_enc, 'lab_enc.pkl')
print("Training completed.")

Prediction Program:

import numpy as np
import os
from sklearn import svm
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib


###### Load our model from our training program ######
clf = joblib.load('brain.pkl')
vectorizer = joblib.load('vectorizer.pkl')
lab_enc = joblib.load('lab_enc.pkl')


###### Prompt user for input, then make a prediction ######
print("Type an email's contents here and I will predict its category")
newData = [input(">> ")]
newDataFeatures = vectorizer.transform(newData)
print("I predict the category is: ", lab_enc.inverse_transform(clf.predict(newDataFeatures)))


###### Feedback loop - Tell the machine whether or not it was correct, and have it learn from the response ######
print("Was my prediction correct? y/n")
feedback = input(">> ")

inputValid = False
while inputValid == False: 

    if feedback == "y" or feedback == "n":
        inputValid = True
    else:
        print("Response not understood. Was my prediction correct? y/n")
        feedback = input(">> ")

if feedback == "y":
    print("I was correct. I'll incorporate this new data into my persistent model to aid in future predictions.")
    #refit the classifier using the new features and label
elif feedback == "n":
    print("I was incorrect. What was the correct category?")
    correctAnswer = input(">> ")
    print("Got it. I'll incorporate this new data into my persistent model to aid in future predictions.")
    #refit the classifier using the new features and label

From what reading I've done, I have gathered that SVM doesn't really support incremental learning, so I figure I need to incorporate the new data into the old training data and retrain the entire model from scratch every time I have new data to add to it. Which is fine, but I'm not too sure how to go about actually implementing it. Would I need the Prediction Program to update the two CSV files to include the new data so that training could start over?

Upvotes: 0

Views: 2307

Answers (1)

Rudy
Rudy

Reputation: 446

I ended up figuring out that the conceptual answer to my question was that I needed to update the CSV files which I was initially using to train the machine. After receiving feedback, I simply wrote out the new features and labels to their respective CSV files and can then retrain the machine with that new information included in the training data set.

Upvotes: 0

Related Questions