Reputation: 3198

Means to save a Python kmodes clustering model to disk?

Background

I am currently using the kmodes python package to perform unsupervised learning on data that includes categorical parameters.

I need to be able to save these models, as I am planning to use it in a production pipeline where I wish to be able to "roll back" to older, working models if something in the pipeline fails.

Requirements

I can use any file format, including HDF5 format. I am also not wedded to kmodes, however I do need to be able to handle mixed categorical and numerical data.

Help

I cannot seem to find any way that I can save the full kmodes model to disk, but I'm hoping that I'm just missing something obvious. Please provide any potential options.

Upvotes: 0

Answers (3)

cacti5

Reputation: 2106

You are looking for the Python pickle library.

The pickle module implements an algorithm for turning an arbitrary Python object into a series of bytes. This process is also called serializing” the object. The byte stream representing the object can then be transmitted or stored, and later reconstructed to create a new object with the same characteristics.

I think this would be a very helpful resource for you in implementing it.

Another library to look into includes cPickle. Why?

First, cPickle can be up to 1000 times faster than pickle because the former is implemented in C.

Given you are needing to save your models to disk, it probably means you model is decently big. Time is a priority - and this will save you a ton of time.

Second, in the cPickle module the callables Pickler() and Unpickler() are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses. Most applications have no need for this functionality and should benefit from the greatly improved performance of the cPickle module.

So it depends on your program and required functionality. A good example of using cPickle can be found here

Upvotes: 1

chthonicdaemon

Reputation: 19770

Let's start with the example clustering from the project's README:

import numpy as np
from kmodes.kmodes import KModes

# random categorical data
data = np.random.choice(20, (100, 10))

km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

We can now save this using the pickle module:

import pickle

# It is important to use binary access
with open('km.pickle', 'wb') as f:
    pickle.dump(km, f)

To read back the object, use

with open('km.pickle', 'rb') as f:
    km = pickle.load(f)

Upvotes: 8

svohara

Reputation: 2189

It appears that the kmodes and kprototypes classes inherit from scikit learn’s BaseEstimator. In sklearn, you can save/load a trained model via standard serialization, using pickle.

Here’s a link to sklearn docs on saving a model using pickle or the serialization code from joblib: http://scikit-learn.org/stable/modules/model_persistence.html

Does this answer address your problem? Are the kmodes models not serializable in your application?