swepab
swepab

Reputation: 527

Matrix (X) is reduced in dimensionality (rows) in custom sklearn transformer class

Given code below, the custom transformer class i'm trying to build - which aims to add a few columns and runt that through a grid search - works well on its own but drops in dimensionality in the rows when executed through a pipeline. Maybe someone can explain what goes wrong, im clearly missing something here. Search for comment "What happends here, dimensionality reduced in rows?" where I have print of the problem. The full code for execution can be found below!

import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn import linear_model
from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import clone
from sklearn.base import TransformerMixin
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

dict_breast_c = load_breast_cancer()
X = pd.DataFrame(dict_breast_c.data, columns=dict_breast_c.feature_names)
X.columns = [col.replace(" ", "_").replace("mean", "avg") for col in X.columns]
X_sub = X[[col for col in X.columns if col.find("avg") >= 0]]
X_feat_hldr = X[[col for col in X.columns if col not in (X_sub.columns)]]

y = pd.Series(dict_breast_c.target)

print ("Full X matrix shape: {}".format(X_sub.shape))
print ("Full feature holder shape: {}".format(X_feat_hldr.shape))
print ("Target vector: {}".format(y.shape))

class c_FeatureAdder(BaseEstimator, TransformerMixin):

    def __init__(self, X_feat_hldr, add_error_feat = True, add_worst_feat = True): # no *args or **kargs
        self.add_error_feat = add_error_feat
        self.add_worst_feat = add_worst_feat
        self.list_col_error = list_col_error
        self.list_col_wrst = list_col_wrst
        self.X_feat_hldr = X_feat_hldr

    def fit(self, X, y=None):
        return self  # nothing else to do

    def transform(self, X, y=None):

        if self.add_error_feat and not self.add_worst_feat:
            print ("Adding error (std) features:")
            return np.c_([X, self.X_feat_hldr[self.list_col_error]])

        elif not self.add_error_feat and self.add_worst_feat:
            print ("Adding worst features:")
            return np.c_([X, self.X_feat_hldr[self.list_col_wrst]])

        elif self.add_error_feat and self.add_worst_feat:
            # What happends here, dimensionality reduced in rows?
            print ("Adding error (std) features and worst features")
            print ("Feature: {}".format(self.list_col_error))
            print ("Feature: {}".format(self.list_col_wrst))
            print ("Something happends to number of rows! {}".format(X.shape))
            print (self.X_feat_hldr.shape)
            print (np.c_[X, self.X_feat_hldr[self.list_col_wrst].values, self.X_feat_hldr[self.list_col_error].values])
            return np.c_[X, self.X_feat_hldr[self.list_col_wrst].values, self.X_feat_hldr[self.list_col_error].values]

        else:
            print ("Adding no new features, passing indata:")
            return X




# Set a classifier, start with base form of logistic regression
clf_log_reg = linear_model.LogisticRegression(random_state=1234)

# Input into pipeline for doing feature adding to main data
list_col_error = [col for col in X_feat_hldr[0:2] if col.find("error") >= 0][0:1]
list_col_wrst = [col for col in X_feat_hldr[0:2] if col.find("worst") >= 0][0:2]

print (list_col_error)
print (list_col_wrst)

# Generate a pipeline of wanted transformers on data. End with classifier
pipe_log_reg = Pipeline(
    [('add_feat', c_FeatureAdder(X_feat_hldr))
    ,('clf', clf_log_reg)]
)




# Set the parameter grid to be checked for pipe above. Only thing being changed is the adding of features through c_FeatureAdder() class
param_grid = {
    'add_feat__add_error_feat' : [True, False]
    ,'add_feat__add_worst_feat' : [True, False]
    ,'clf__penalty' : ['l2', 'l1']
    ,'clf__C' : [1]
}



# Initialize GridSearch over parameter spacea
gs_lg_reg = GridSearchCV(
    estimator = pipe_log_reg
    ,param_grid = param_grid
    ,scoring = 'accuracy'
    ,n_jobs = 1
)


# Assign names
X_train = X_sub.values
y_train = y.values

print (X_train.shape)

# Fit data
gs_lg_reg_fit = gs_lg_reg.fit(X_train
                              ,y_train)

# Best estimator from GridSearch
gs_optimal_mdl_lg_reg = gs_lg_reg_fit.best_estimator_

Upvotes: 1

Views: 55

Answers (1)

adrin
adrin

Reputation: 4896

You have a few mistakes.

You're using a global variable in your transformer:

    self.list_col_error = list_col_error
    self.list_col_wrst = list_col_wrst

If your transformer needs an input, it should take it as a parameter to the constructor (__init__). Avoid relying on global variables in your class.

Your transformer should be able to transform any number of given samples.

The idea of transform function is to transform any given sample, or sample set. In practice, you may persist your transformer and later on use it to transform an arbitrary number of newly given samples using it. You should use the fit function to get whatever you need as input, and fit your transformer accordingly. the idea is that once you've fit your whole pipeline, you should be able to give it one sample, and get the output for that one sample from your pipeline.

GridSearchCV does a 3-fold cross validation by default.

As stated in the docs:

cv : int, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to use the default 3-fold cross validation,

Which means it uses 2/3 of your input data to fit the pipeline at each stage. If you check your output, you see that your code complains that the new data has 379 rows, whereas the old data had 569 rows. 379 / 569 = 0.666080844. This is where the changed number of rows comes from.

Upvotes: 1

Related Questions