Wlodek K
Wlodek K

Reputation: 70

Creating a custom transformer in scikit-learn that adds cluster labels

I am writing a custom transformer in scikit-learn that adds cluster labels as a new column using stock KMeans to pandas dataframe. The custom transformer should fit to existing data then transform the unseen data by adding the a new column with the index name 'Cluster' and return a new dataframe with the additional column without modifying the original dataframe. Below is the code that that I came up with:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans

class AddClustersFeature(BaseEstimator, TransformerMixin):
    def __init__(self, clusters = 10): 
        self.clusters = clusters
        self.model = KMeans(n_clusters = self.clusters)
           
    def fit(self, X):
        self.X=X
        self.model.fit (self.X)
        return self.model
       
    def transform(self, X):
        self.X=X
        X_=X.copy() # avoiding modification of the original df
        
        X_['Clusters'] = self.model.transform(self.X_).labels_
        
        return X_

cluster_enc_tr_data = AddClustersFeature().fit_transform(enc_tr_data)
cluster_enc_tr_data

Unfortunately the code does work properly. The result is a dataframe with cluster numbers as column indices, with row numbers and unknown previously values. Any help or tips will greatly be appreciated.

Update 23 of June 21 v2: Please see below the code after implementing Ben's revised comments. It works perfectly now.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans

class AddClustersFeature(BaseEstimator, TransformerMixin):
    def __init__(self, clusters = 10): 
        self.clusters = clusters
        
           
    def fit(self, X):
        self.X=X
        self.model = KMeans(n_clusters = self.clusters)
        self.model.fit (self.X)
        return self
       
    def transform(self, X):
        self.X=X
        X_=X.copy() # avoiding modification of the original df
        X_['Clusters'] = self.model.predict(X_)
        return X_

cluster_enc_tr_data = AddClustersFeature().fit_transform(enc_tr_data)

Upvotes: 0

Views: 872

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12592

The fit method must always return self.

The problem here is that fit_transform(X, y), inherited from TransformerMixin, is just fit(X, y).transform(X); your fit now returns the underlying KMeans transformer, and that is used to transform X instead of your transform.

A few more notes though:

  1. KMeans.transform gives the cluster-distance matrix, but you want the cluster labels. Use predict instead. And drop labels_, so just X_['Clusters'] = self.model.predict(X_).)

  2. __init__ should only set attributes that appear in its signature, in order for cloning to work (required for e.g. hyperparameter searches). You can define self.model at fit time.

  3. in transform, you use self.X_ but it is never defined; I guess you mean just X_. There no real reason to save X at fit time either; self.X is never really needed?

  4. This will only work on dataframes; that may not be a problem for you, but keep it in mind. (You can't use this as a step in a pipeline after builtin sklearn transformers, because those will return numpy arrays.)

Upvotes: 1

Related Questions