Reputation: 431
I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income: https://www.kaggle.com/uciml/adult-census-income), using StandardScaler as a base. Here is my Python code that I used:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
class CustomScaler(BaseEstimator,TransformerMixin):
def __init__(self,columns,copy=True,with_mean=True,with_std=True):
self.scaler = StandardScaler(copy,with_mean,with_std)
self.columns = columns
self.mean_ = None
self.var_ = None
def fit(self, X, y=None):
self.scaler.fit(X[self.columns], y)
self.mean_ = np.mean(X[self.columns])
self.var_ = np.var(X[self.columns])
return self
def transform(self, X, y=None, copy=None):
init_col_order = X.columns
X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
X=new_df_upsampled.copy()
X.drop('income',axis=1,inplace=True)
continuous = df.iloc[:, np.r_[0,2,10:13]]
#basically independent variables that I consider continuous
columns_to_scale = continuous
scaler = CustomScaler(columns_to_scale)
scaler.fit(X)
However when I tried to run the scaler, I met this problem:
So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?
Thank you!
Upvotes: 1
Views: 1880
Reputation: 12748
I agree with @AntoineDubuis, that ColumnTransformer
is a better (builtin!) way to do this. That said, I'd like to address where your code goes wrong.
In fit
, you have self.scaler.fit(X[self.columns], y)
; this indicates that self.columns
should be a list of column names (or a few other options). But you've defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]]
, which is a dataframe.
A couple other issues:
__init__
that come from its signature, or cloning will fail. Move self.scaler
to fit
, and save its parameters copy
etc. directly at __init__
. Don't initialize mean_
or var_
.mean_
or var_
. You can keep them if you want, but the relevant statistics are stored in the scaler object.Upvotes: 1
Reputation: 5324
There is no need to create a custom transformer for this problematic. as this operation can be performed using ColumnTransformer
. This transformer allows different columns of the input to be transformed separately.
The example below is scaling the columns ['A', 'B']
without changing the column C
.
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({'A': np.arange(10),
'B': np.arange(10),
'C': np.arange(10)})
transformer = make_column_transformer(
(StandardScaler(), ['A', 'B']),
remainder='passthrough'
)
pd.DataFrame(transformer.fit_transform(df), columns=df.columns)
This output the following result:
A B C
0 -1.566699 -1.566699 0.0
1 -1.218544 -1.218544 1.0
2 -0.870388 -0.870388 2.0
3 -0.522233 -0.522233 3.0
4 -0.174078 -0.174078 4.0
5 0.174078 0.174078 5.0
6 0.522233 0.522233 6.0
7 0.870388 0.870388 7.0
8 1.218544 1.218544 8.0
9 1.566699 1.566699 9.0
Upvotes: 3