Any suggestions/best practices for persisting and re-using trained machine learning models ? I'm developing models in Python or R. Then these models must be used in production workflow for scoring (where R is not available). For example there could be a logistic regression model trained in R. Now new observations need to be scored against this model. The scoring engine must be fast and scalable. I've thought of following PMML ( http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language ). It is easy to convert most of the models developed in R to pmml. However, I couldn't find a useful scoring engine for PMML models. For example, there is augustus ( https://code.google.com/p/augustus/ ) but it implements only 3-4 models yet. Serialize the models using pickle in Python and write the consumer in Python. Any thoughts/suggestions on the right approach ?

Reputation: 178

Machine learning models persistence options

Any suggestions/best practices for persisting and re-using trained machine learning models ? I'm developing models in Python or R. Then these models must be used in production workflow for scoring (where R is not available). For example there could be a logistic regression model trained in R. Now new observations need to be scored against this model. The scoring engine must be fast and scalable. I've thought of following

PMML (http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language). It is easy to convert most of the models developed in R to pmml. However, I couldn't find a useful scoring engine for PMML models. For example, there is augustus (https://code.google.com/p/augustus/) but it implements only 3-4 models yet.
Serialize the models using pickle in Python and write the consumer in Python.

Any thoughts/suggestions on the right approach ?

Upvotes: 5

Answers (3)

Amarpreet Singh

Reputation: 277

You may use messagepack. It uses JSON like format to store the model. It's fast and takes less memory. https://github.com/muayyad-alsadi/sklearn_msgpack

Upvotes: 0

Shabaz Patel

Reputation: 291

You can save and load the model using pickle in python as follows,

import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)

Other way is to use joblib which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators.

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
clf = joblib.load('filename.pkl')

This model can then be deployed in production as RESTful APIs.

Upvotes: 0

logc

Reputation: 3923

Scikit-learn, a mature library in this field, uses pickle for its persistence of models. I gather you are writing your own functions to train the model, but looking at the established libraries can tell you about best practices.

On the other hand, JSON can be read from many languages. That is its main advantage. If your plan to serve model results from another language, and your models are fairly simple Python objects, then serializing them to JSON should be pretty easy.

Upvotes: 2

Machine learning models persistence options

Answers (3)

Related Questions