Reputation: 6564
I am using sklearn
and mlxtend.regressor.StackingRegressor
to build a stacked regression model.
For example, say I want the following small pipeline:
Unfortunately this is not possible, because StackingRegressor
doesn't accept NaN
in its input data.
This is even if its regressors know how to handle NaN
, as it would be in my case where the regressors are actually pipelines which perform data imputation.
However, this is not a problem: I can just move data imputation outside the stacked regressor. Now my pipeline looks like this:
sklearn.tree.DecisionTreeRegressor
.One might try to implement it as follows (the entire minimal working example in this gist, with comments):
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler()),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
sr_tree = DecisionTreeRegressor()
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
)),
('model', StackingRegressor(
regressors=[sr_linear, sr_tree],
meta_regressor=DecisionTreeRegressor(),
use_features_in_secondary=True
))
])
Note that the "outer" ColumnTransformer
(in stacked_regressor
) returns a numpy
matrix.
But the "inner" ColumnTransformer
(in sr_linear
) expects a pandas.DataFrame
, so I had to convert the matrix back to a data frame using step back_to_pandas
.
(To use get_feature_names_out
I had to use the nightly version of sklearn, because the current stable 1.0.2 version does not support it yet. Fortunately it can be installed with one simple command.)
The above code fails when calling stacked_regressor.fit()
, with the following error (the entire stacktrace is again in the gist):
ValueError: make_column_selector can only be applied to pandas dataframes
However, because I added the back_to_pandas
step to my outer pipeline, the inner pipelines should be getting a pandas data frame!
In fact, if I only fit_transform()
my ct_imputation
object, I clearly obtain a pandas data frame.
I cannot understand where and when exactly the data which gets passed around ceases to be a data frame.
Why is my code failing?
Upvotes: 2
Views: 1540
Reputation: 6564
The correct thing to do was:
mlxtend
's to sklearn
's StackingRegressor
. I believe the former was creater when sklearn
still didn't have a stacking regressor. Now there is no need to use more 'obscure' solutions. sklearn
's stacking regressor works pretty well.sklearn
's DecisionTreeRegressor
cannot handle categorical data among the features.A working version of the code is given below:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
import numpy as np
import pandas as pd
def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
if df[column].dtype == object or 'MSSubClass' in column:
df[column] = pd.Categorical(df[column])
return df
d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')
sr_linear = Pipeline(steps=[
('preprocessing', StandardScaler()),
('model', LinearRegression())
])
ct_preprocessing = ColumnTransformer(transformers=[
('categorical',
make_pipeline(
SimpleImputer(strategy='constant', fill_value='None'),
OneHotEncoder(sparse=False, handle_unknown='ignore')
),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacking_regressor = Pipeline(steps=[
('preprocessing', ct_preprocessing),
('model', StackingRegressor(
estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
final_estimator=DecisionTreeRegressor(),
passthrough=True
))
])
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
stacking_regressor.fit(X_train, y_train)
Thanks to user amiola for his answer putting me on the right track.
Upvotes: 1
Reputation: 3026
Imo the issue has to be ascribed to StackingRegressor
. Actually, I am not an expert on its usage and still I have not explored its source code, but I've found this sklearn issue - #16473 which seems implying that << the concatenation [of regressors and meta_regressors] does not preserve dataframe >> (though this is referred to sklearn
StackingRegressor
instance, rather than on mlxtend
one).
Indeed, have a look at what happens once you replace it with your sr_linear
pipeline:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mlxtend.regressor import StackingRegressor
import numpy as np
import pandas as pd
# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame
# Small data preprocessing:
for column in d.columns:
if d[column].dtype == object or column == 'MSSubClass':
d[column] = pd.Categorical(d[column])
d.drop(columns='Id', inplace=True)
# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
)),
('mdl', sr_linear)
])
stacked_regressor.fit(X_train, y_train)
Observe that I had to slightly modify the 'back_to_pandas'
step because for some reason pd.DataFrame
was changing the dtypes
of the columns to 'object'
only (from 'category'
and 'float64'
), therefore clashing with the imputation performed in sr_linear
. For this reason, I've applied .astype(types)
to the pd.DataFrame
constructor, where types
is defined as follows (based on the implementation of .get_feature_names_out()
method of the SimpleImputer
from the dev version of sklearn
):
types = {}
for col in d.columns[:-1]:
if d[col].dtype == 'category':
types['categorical__' + col] = str(d[col].dtype)
else:
types['numerical__' + col] = str(d[col].dtype)
Upvotes: 1