Reputation: 70
I am writing a custom transformer in scikit-learn that adds cluster labels as a new column using stock KMeans to pandas dataframe. The custom transformer should fit to existing data then transform the unseen data by adding the a new column with the index name 'Cluster' and return a new dataframe with the additional column without modifying the original dataframe. Below is the code that that I came up with:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
class AddClustersFeature(BaseEstimator, TransformerMixin):
def __init__(self, clusters = 10):
self.clusters = clusters
self.model = KMeans(n_clusters = self.clusters)
def fit(self, X):
self.X=X
self.model.fit (self.X)
return self.model
def transform(self, X):
self.X=X
X_=X.copy() # avoiding modification of the original df
X_['Clusters'] = self.model.transform(self.X_).labels_
return X_
cluster_enc_tr_data = AddClustersFeature().fit_transform(enc_tr_data)
cluster_enc_tr_data
Unfortunately the code does work properly. The result is a dataframe with cluster numbers as column indices, with row numbers and unknown previously values. Any help or tips will greatly be appreciated.
Update 23 of June 21 v2: Please see below the code after implementing Ben's revised comments. It works perfectly now.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
class AddClustersFeature(BaseEstimator, TransformerMixin):
def __init__(self, clusters = 10):
self.clusters = clusters
def fit(self, X):
self.X=X
self.model = KMeans(n_clusters = self.clusters)
self.model.fit (self.X)
return self
def transform(self, X):
self.X=X
X_=X.copy() # avoiding modification of the original df
X_['Clusters'] = self.model.predict(X_)
return X_
cluster_enc_tr_data = AddClustersFeature().fit_transform(enc_tr_data)
Upvotes: 0
Views: 872
Reputation: 12592
The fit
method must always return self
.
The problem here is that fit_transform(X, y)
, inherited from TransformerMixin
, is just fit(X, y).transform(X)
; your fit
now returns the underlying KMeans
transformer, and that is used to transform X
instead of your transform
.
A few more notes though:
KMeans.transform
gives the cluster-distance matrix, but you want the cluster labels. Use predict
instead. And drop labels_
, so just X_['Clusters'] = self.model.predict(X_)
.)
__init__
should only set attributes that appear in its signature, in order for cloning to work (required for e.g. hyperparameter searches). You can define self.model
at fit
time.
in transform
, you use self.X_
but it is never defined; I guess you mean just X_
. There no real reason to save X
at fit time either; self.X
is never really needed?
This will only work on dataframes; that may not be a problem for you, but keep it in mind. (You can't use this as a step in a pipeline after builtin sklearn
transformers, because those will return numpy arrays.)
Upvotes: 1