Reputation: 2767
i already have some boolean features (1 or 0), but i have some categorical vars that need OHE and some numeric vars that need imputed/scaled ... i can add the categorical vars + numeric vars to a pipeline column transformer but how do i add the boolean features to the pipeline so they get included in the model? can't find any examples or a good phrase to search for this kind of dilemma ... any ideas?
here is an example from sklearn combining a num and cat pipeline however what if some of my features are already in boolean form (1/0) and do not need preprocessed/OHE ... how do i keep those features (i.e. add it to the pipeline with the num and cat variables)?
source: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
titanic_url = ('https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
X = data.drop('survived', axis=1)
y = data['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
Upvotes: 2
Views: 1313
Reputation: 8931
Use remainder='passthrough
to pass through any columns not dealt with in your column_transformers
list. That said, the @thePurplePython answer can be quite useful.
preprocessor_pipeline = sklearn.compose.ColumnTransformer(column_transformers, remainder='passthrough')
Alternatively, pass a pipeline triplet of 'passthrough' instead of a transformer function and a list of columns to pass through.
('passthrough','passthrough',passthrough_columns)
Upvotes: 1
Reputation: 2767
I figured my own question out here ... with ColumnTransformer I can just add more features to a list (i.e. numeric_features and categorical_features like in the code from my question) and with FeatureUnion I can use this DF Selector class to add features to a pipeline ... details can be found in this notebook => https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb
# Create a class to select numerical or categorical columns
class PandasDataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# ex
feature_list = [...]
("num_features", Pipeline([\
("select_num_features", PandasDataFrameSelector(feature_list)),\
("scales", StandardScaler())
Upvotes: 2