Hoang Cuong Nguyen
Hoang Cuong Nguyen

Reputation: 431

How to build a custom scaler based on StandardScaler?

I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income: https://www.kaggle.com/uciml/adult-census-income), using StandardScaler as a base. Here is my Python code that I used:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator,TransformerMixin): 
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self

    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]


continuous = df.iloc[:, np.r_[0,2,10:13]] 
#basically independent variables that I consider continuous

columns_to_scale = continuous

scaler = CustomScaler(columns_to_scale)


However when I tried to run the scaler, I met this problem: enter image description here

So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?

Thank you!

Upvotes: 1

Views: 1880

Answers (2)

Ben Reiniger
Ben Reiniger

Reputation: 12748

I agree with @AntoineDubuis, that ColumnTransformer is a better (builtin!) way to do this. That said, I'd like to address where your code goes wrong.

In fit, you have self.scaler.fit(X[self.columns], y); this indicates that self.columns should be a list of column names (or a few other options). But you've defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]], which is a dataframe.

A couple other issues:

  1. you should only set attributes in __init__ that come from its signature, or cloning will fail. Move self.scaler to fit, and save its parameters copy etc. directly at __init__. Don't initialize mean_ or var_.
  2. you never actually use mean_ or var_. You can keep them if you want, but the relevant statistics are stored in the scaler object.

Upvotes: 1

Antoine Dubuis
Antoine Dubuis

Reputation: 5324

There is no need to create a custom transformer for this problematic. as this operation can be performed using ColumnTransformer. This transformer allows different columns of the input to be transformed separately.

The example below is scaling the columns ['A', 'B'] without changing the column C.

import numpy as np
import pandas as pd

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'A': np.arange(10), 
                 'B': np.arange(10),
                 'C':  np.arange(10)})

transformer = make_column_transformer(
    (StandardScaler(), ['A', 'B']),

pd.DataFrame(transformer.fit_transform(df), columns=df.columns)

This output the following result:

          A         B    C
0 -1.566699 -1.566699  0.0
1 -1.218544 -1.218544  1.0
2 -0.870388 -0.870388  2.0
3 -0.522233 -0.522233  3.0
4 -0.174078 -0.174078  4.0
5  0.174078  0.174078  5.0
6  0.522233  0.522233  6.0
7  0.870388  0.870388  7.0
8  1.218544  1.218544  8.0
9  1.566699  1.566699  9.0

Upvotes: 3

Related Questions