thePurplePython
thePurplePython

Reputation: 2767

sklearn columntransformer include existing features (ex: boolean) that do not need transformed/preprocessed in pipeline

i already have some boolean features (1 or 0), but i have some categorical vars that need OHE and some numeric vars that need imputed/scaled ... i can add the categorical vars + numeric vars to a pipeline column transformer but how do i add the boolean features to the pipeline so they get included in the model? can't find any examples or a good phrase to search for this kind of dilemma ... any ideas?

here is an example from sklearn combining a num and cat pipeline however what if some of my features are already in boolean form (1/0) and do not need preprocessed/OHE ... how do i keep those features (i.e. add it to the pipeline with the num and cat variables)?

source: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

titanic_url = ('https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')

data = pd.read_csv(titanic_url)

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

Upvotes: 2

Views: 1313

Answers (2)

BSalita
BSalita

Reputation: 8931

Use remainder='passthrough to pass through any columns not dealt with in your column_transformers list. That said, the @thePurplePython answer can be quite useful.

preprocessor_pipeline = sklearn.compose.ColumnTransformer(column_transformers, remainder='passthrough')

Alternatively, pass a pipeline triplet of 'passthrough' instead of a transformer function and a list of columns to pass through.

('passthrough','passthrough',passthrough_columns)

Upvotes: 1

thePurplePython
thePurplePython

Reputation: 2767

I figured my own question out here ... with ColumnTransformer I can just add more features to a list (i.e. numeric_features and categorical_features like in the code from my question) and with FeatureUnion I can use this DF Selector class to add features to a pipeline ... details can be found in this notebook => https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

# Create a class to select numerical or categorical columns 
class PandasDataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

# ex
feature_list = [...]
("num_features", Pipeline([\
                    ("select_num_features", PandasDataFrameSelector(feature_list)),\
                    ("scales", StandardScaler())

Upvotes: 2

Related Questions