Reputation: 3062
I am doing a text classification in python and I want to use it in production environment for making prediction on new document. I am using TfidfVectorizer to build bagofWord.
I am doing:
X_train = vectorizer.fit_transform(clean_documents_for_train, classLabel).toarray()
Then I am doing cross validation and building the model using SVM. After that I am saving the model.
For making prediction on my test data I am loading that model in another script where I have the same TfidfVectorizer and I know I can't do fit_transform on my testing data. I have to do:
X_test = vectorizer.transform(clean_test_documents, classLabel).toarray()
But this is not possible because I have to fit first. I know there is a way. I can load my training data and perform fit_transform
like I did during building the model but my training data is very large and every time I want to predict I can't do that. So my question is:
Upvotes: 2
Views: 5931
Reputation: 152
You can simply use the clf.predict with the .apply and lambda
datad['Predictions']=datad['InputX'].apply(lambda x: unicode(clf.predict(count_vect.transform([x]))))
Upvotes: 0
Reputation: 3301
I was redirected here based on the search "How to use saved model for prediction?". So just to add to YS-L's answer, the final step.
Saving the model
from sklearn.externals import joblib
joblib.dump(fittedModel, 'name.model')
Load the saved model and predict
fittedModel = joblib.load('name.model')
fittedModel.predict(X_new) # X_new is unseen example to be predicted
Upvotes: 3
Reputation: 14748
The vectorizer is part of your model. When you save your trained SVM model, you need to also save the corresponding vectorizer.
To make this more convenient, you can use Pipeline to construct a single "fittable" object that represents the steps needed to transform raw input to prediction output. In this case, the pipeline consists of a Tf-Idf extractor and an SVM classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.pipeline import Pipeline
vectorizer = TfidfVectorizer()
clf = svm.SVC()
tfidf_svm = Pipeline([('tfidf', vectorizer), ('svc', clf)])
documents, y = load_training_data()
tfidf_svm.fit(documents, y)
This way, only a single object needs to be persisted:
from sklearn.externals import joblib
joblib.dump(tfidf_svm, 'model.pkl')
To apply the model on your testing document, load the trained pipeline and simply use its predict
function as usual with raw document(s) as input.
Upvotes: 5