Asad
Asad

Reputation: 3062

How to use save model for prediction in python

I am doing a text classification in python and I want to use it in production environment for making prediction on new document. I am using TfidfVectorizer to build bagofWord.

I am doing:

X_train = vectorizer.fit_transform(clean_documents_for_train, classLabel).toarray()

Then I am doing cross validation and building the model using SVM. After that I am saving the model.

For making prediction on my test data I am loading that model in another script where I have the same TfidfVectorizer and I know I can't do fit_transform on my testing data. I have to do:

X_test = vectorizer.transform(clean_test_documents, classLabel).toarray()

But this is not possible because I have to fit first. I know there is a way. I can load my training data and perform fit_transform like I did during building the model but my training data is very large and every time I want to predict I can't do that. So my question is:

Upvotes: 2

Views: 5931

Answers (3)

mkpisk
mkpisk

Reputation: 152

You can simply use the clf.predict with the .apply and lambda

datad['Predictions']=datad['InputX'].apply(lambda x: unicode(clf.predict(count_vect.transform([x])))) 

Upvotes: 0

mythicalcoder
mythicalcoder

Reputation: 3301

I was redirected here based on the search "How to use saved model for prediction?". So just to add to YS-L's answer, the final step.

Saving the model

from sklearn.externals import joblib
joblib.dump(fittedModel, 'name.model')

Load the saved model and predict

fittedModel = joblib.load('name.model')
fittedModel.predict(X_new)  # X_new is unseen example to be predicted

Upvotes: 3

YS-L
YS-L

Reputation: 14748

The vectorizer is part of your model. When you save your trained SVM model, you need to also save the corresponding vectorizer.

To make this more convenient, you can use Pipeline to construct a single "fittable" object that represents the steps needed to transform raw input to prediction output. In this case, the pipeline consists of a Tf-Idf extractor and an SVM classifier:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.pipeline import Pipeline

vectorizer = TfidfVectorizer()
clf = svm.SVC()
tfidf_svm = Pipeline([('tfidf', vectorizer), ('svc', clf)])

documents, y = load_training_data()
tfidf_svm.fit(documents, y)

This way, only a single object needs to be persisted:

from sklearn.externals import joblib
joblib.dump(tfidf_svm, 'model.pkl')

To apply the model on your testing document, load the trained pipeline and simply use its predict function as usual with raw document(s) as input.

Upvotes: 5

Related Questions