Reputation: 21
I'm trying to fit Decision Tree model on UCI Adult dataset. I built the following pipeline to do so:
nominal_features = ['workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race', 'sex', 'native-country']
nominal_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore'))
])
numeric_features = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features),
('nominal', nominal_transformer, nominal_features)
]) # remaining columns will be dropped by default
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', DecisionTreeClassifier(criterion='entropy', random_state=0))
])
I then fit my model by calling
clf.fit(X_train, y_train)
Then, when I try to get feature importances,
clf.named_steps['classifier'].feature_importances_
I get an array of shape (104,)
array([1.39312528e-01, 1.92086014e-01, 1.15276068e-01, 4.01797967e-02,
7.08805229e-02, 3.99687904e-03, 6.68727677e-03, 0.00000000e+00,
1.02021005e-02, 5.06637671e-03, 7.97826949e-03, 5.64939616e-03,
0.00000000e+00, 9.09583016e-04, 1.84022196e-03, 9.29047900e-04,
1.74001682e-04, 8.55362503e-05, 2.32440522e-03, 4.65023589e-04,
4.13278579e-03, 3.68265995e-03, 1.78503960e-02, 8.33035943e-03,
6.94454768e-03, 1.75988171e-02, 5.40933687e-04, 7.51299294e-03,
6.07480929e-03, 2.28627732e-03, 1.32219786e-03, 1.92990938e-01,
1.18517448e-03, 1.61377248e-03, 5.72167000e-04, 1.34920904e-03,
5.41685180e-03, 0.00000000e+00, 9.16416279e-03, 1.05824472e-02,
3.07744966e-03, 3.07152204e-03, 5.06657379e-03, 5.21819782e-03,
0.00000000e+00, 7.49534136e-03, 2.83936918e-03, 8.62398812e-03,
5.78720378e-03, 5.37536831e-03, 2.99744077e-03, 1.87247908e-03,
4.87696805e-04, 1.58422357e-03, 2.20761597e-03, 5.57396015e-03,
1.17619435e-03, 1.87465473e-03, 4.08710965e-03, 6.73508851e-04,
6.02887867e-03, 2.38887308e-03, 4.52029746e-03, 7.28018074e-05,
5.13158297e-04, 2.66768058e-04, 0.00000000e+00, 3.28378333e-04,
0.00000000e+00, 8.55362503e-05, 0.00000000e+00, 7.89886262e-04,
1.84475320e-04, 1.37879652e-03, 0.00000000e+00, 3.27800552e-04,
1.95189232e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 9.00792536e-04, 0.00000000e+00, 2.20606426e-04,
5.82787439e-04, 4.85000896e-04, 5.33409400e-04, 0.00000000e+00,
8.75840665e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
4.65546160e-04, 3.37472507e-04, 2.50837357e-04, 2.52474592e-04,
0.00000000e+00, 1.47818105e-04, 3.06829767e-04, 3.73651596e-04,
1.58778645e-04, 4.40566013e-03, 8.55362503e-05, 2.51672361e-04])
which is not correct as I only have 13 features. I know the reason for this is OneHotencoding.
How can get the actual feature importances?
Upvotes: 2
Views: 1641
Reputation: 4926
Fundamentally, the importance of a data column can be obtained by summing the importances of all the features that are based on it. Identifying column-to-feature mappings could be a little difficult to do by hand, but you can always use automated tools for that.
For example, the SkLearn2PMML package can translate Scikit-Learn pipelines to PMML representation, and perform various analyses and transformations while doing so. The calculation of aggregate feature importances is well supported.
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
pipeline = PMMLPipeline([
("preprocessor", preprocessor),
("classifier", clf)
])
pipeline.fit(X, y)
# Re-map the dynamic attribute to a static pickleable attribute
clf.pmml_feature_importances_ = clf.feature_importances_
sklearn2pmml(pipeline, "PipelineWithImportances.pmml.xml")
Upvotes: 0
Reputation: 60370
I am afraid you cannot get importances for your initial features here. Your decision tree does not know anything about them; the only thing it sees and knows about is the encoded ones, and nothing else.
You may want to try the permutation importance instead, which has several advantages over the tree-based feature importance; it is also easily applicable to pipelines - see Permutation importance using a Pipeline in SciKit-Learn.
Upvotes: 2