qAp
qAp

Reputation: 1189

How to save a custom transformer in sklearn?

I am not able to load an instance of a custom transformer saved using either sklearn.externals.joblib.dump or pickle.dump because the original definition of the custom transformer is missing from the current python session.

Suppose in one python session, I define, create and save a custom transformer, it can also be loaded in the same session:

from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
from sklearn.externals import joblib

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X


custom_transformer = CustomTransformer()    
joblib.dump(custom_transformer, 'custom_transformer.pkl')

loaded_custom_transformer = joblib.load('custom_transformer.pkl')

Opening up a new python session and loading from 'custom_transformer.pkl'

from sklearn.externals import joblib

joblib.load('custom_transformer.pkl')

raises the following exception:

AttributeError: module '__main__' has no attribute 'CustomTransformer'

The same thing is observed if joblib is replaced with pickle. Saving the custom transformer in one session with

with open('custom_transformer_pickle.pkl', 'wb') as f:
    pickle.dump(custom_transformer, f, -1)

and loading it in another:

with open('custom_transformer_pickle.pkl', 'rb') as f:
    loaded_custom_transformer_pickle = pickle.load(f)

raises the same exception.

In the above, if CustomTransformer is replaced with, say, sklearn.preprocessing.StandardScaler, then it is found that the saved instance can be loaded in a new python session.

Is it possible to be able to save a custom transformer and load it later somewhere else?

Upvotes: 18

Views: 11506

Answers (3)

Schopen Hacker
Schopen Hacker

Reputation: 322

I didn't use the sklearn.externals.joblib but just joblib module, It works:

example:

from sklearn.base import BaseEstimator, TransformerMixin

class CustomNgramVectorize(BaseEstimator, TransformerMixin):
    """Vectorizes texts as n-gram vectors"""
    def __init__(self, text, reduce=True):
        # Create keyword arguments to pass to the 'tf-idf' vectorizer.
        kwargs = {
                'ngram_range': NGRAM_RANGE,  # Use 1-grams + 2-grams.
                'dtype': 'int32',
                'strip_accents': 'unicode',
                'decode_error': 'replace',
                'max_features' : 1000, #limit number of words
                'sublinear_tf': True, # Apply sublinear tf scaling
                'stop_words' : stopwords.words('french'),# drop french stopwords
                'analyzer': TOKEN_MODE,  # Split text into word tokens.
                'min_df': MIN_DOCUMENT_FREQUENCY,
        }
        self.tfidf_vectorizer = TfidfVectorizer(**kwargs)
        self.reduce = reduce
        if self.reduce:
            self.svd = TruncatedSVD(n_components=25, n_iter=25, random_state=12)
    
    def fit(self, X, y=None):
        self.tfidf_vectorizer.fit(X)
            
    def transform(self, X, y=None):
        X = self.tfidf_vectorizer.transform(X)
        # convert to dataframe
        X_df = pd.DataFrame(X.toarray(), columns=sorted(self.tfidf_vectorizer.vocabulary_))
        if self.reduce:
            X_df = self.svd.fit_transform(X_df)
        return X_df

then save it just by using joblib.dump function :

# persist model
import joblib
joblib.dump(vectorizer, 'custom_tfidf_vectorizer.joblib')

Later retrieve it with joblib.load function:

var='route_of_administration'
v = joblib.load('custom_tfidf_vectorizer.joblib')
v.fit(train[var])
X_df = v.transform(train[var])

enter image description here

Upvotes: 0

Nine
Nine

Reputation: 115

It works for me if I pass my transform function in sklearn.preprocessing.FunctionTranformer() and if I save the model using dill.dump() and dill.load a ".pk" file.

Note: I have included the tranform function into a sklearn pipeline with my classifier.

Upvotes: 2

Matthew Plourde
Matthew Plourde

Reputation: 44614

sklearn.preprocessing.StandardScaler works because the class definition is available in the sklearn package installation, which joblib will look up when you load the pickle.

You'll have to make your CustomTransformer class available in the new session, either by re-defining or importing it.

Upvotes: 8

Related Questions