Reputation: 1189
I am not able to load an instance of a custom transformer saved using either sklearn.externals.joblib.dump
or pickle.dump
because the original definition of the custom transformer is missing from the current python session.
Suppose in one python session, I define, create and save a custom transformer, it can also be loaded in the same session:
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
from sklearn.externals import joblib
class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X
custom_transformer = CustomTransformer()
joblib.dump(custom_transformer, 'custom_transformer.pkl')
loaded_custom_transformer = joblib.load('custom_transformer.pkl')
Opening up a new python session and loading from 'custom_transformer.pkl'
from sklearn.externals import joblib
joblib.load('custom_transformer.pkl')
raises the following exception:
AttributeError: module '__main__' has no attribute 'CustomTransformer'
The same thing is observed if joblib
is replaced with pickle
. Saving the custom transformer in one session with
with open('custom_transformer_pickle.pkl', 'wb') as f:
pickle.dump(custom_transformer, f, -1)
and loading it in another:
with open('custom_transformer_pickle.pkl', 'rb') as f:
loaded_custom_transformer_pickle = pickle.load(f)
raises the same exception.
In the above, if CustomTransformer
is replaced with, say, sklearn.preprocessing.StandardScaler
, then it is found that the saved instance can be loaded in a new python session.
Is it possible to be able to save a custom transformer and load it later somewhere else?
Upvotes: 18
Views: 11506
Reputation: 322
I didn't use the sklearn.externals.joblib
but just joblib
module, It works:
example:
from sklearn.base import BaseEstimator, TransformerMixin
class CustomNgramVectorize(BaseEstimator, TransformerMixin):
"""Vectorizes texts as n-gram vectors"""
def __init__(self, text, reduce=True):
# Create keyword arguments to pass to the 'tf-idf' vectorizer.
kwargs = {
'ngram_range': NGRAM_RANGE, # Use 1-grams + 2-grams.
'dtype': 'int32',
'strip_accents': 'unicode',
'decode_error': 'replace',
'max_features' : 1000, #limit number of words
'sublinear_tf': True, # Apply sublinear tf scaling
'stop_words' : stopwords.words('french'),# drop french stopwords
'analyzer': TOKEN_MODE, # Split text into word tokens.
'min_df': MIN_DOCUMENT_FREQUENCY,
}
self.tfidf_vectorizer = TfidfVectorizer(**kwargs)
self.reduce = reduce
if self.reduce:
self.svd = TruncatedSVD(n_components=25, n_iter=25, random_state=12)
def fit(self, X, y=None):
self.tfidf_vectorizer.fit(X)
def transform(self, X, y=None):
X = self.tfidf_vectorizer.transform(X)
# convert to dataframe
X_df = pd.DataFrame(X.toarray(), columns=sorted(self.tfidf_vectorizer.vocabulary_))
if self.reduce:
X_df = self.svd.fit_transform(X_df)
return X_df
then save it just by using joblib.dump
function :
# persist model
import joblib
joblib.dump(vectorizer, 'custom_tfidf_vectorizer.joblib')
Later retrieve it with joblib.load
function:
var='route_of_administration'
v = joblib.load('custom_tfidf_vectorizer.joblib')
v.fit(train[var])
X_df = v.transform(train[var])
Upvotes: 0
Reputation: 115
It works for me if I pass my transform function in sklearn.preprocessing.FunctionTranformer()
and if I save the model using dill.dump()
and dill.load
a ".pk" file.
Note: I have included the tranform function into a sklearn pipeline with my classifier.
Upvotes: 2
Reputation: 44614
sklearn.preprocessing.StandardScaler
works because the class definition is available in the sklearn package installation, which joblib
will look up when you load the pickle.
You'll have to make your CustomTransformer
class available in the new session, either by re-defining or importing it.
Upvotes: 8