chesslad
chesslad

Reputation: 21

How do I get feature importances for decision tree pipeline that has preprocessing and classification steps?

I'm trying to fit Decision Tree model on UCI Adult dataset. I built the following pipeline to do so:

nominal_features = ['workclass', 'education', 'marital-status', 'occupation', 
                'relationship', 'race', 'sex', 'native-country']

nominal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

numeric_features = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numeric_features),
        ('nominal', nominal_transformer, nominal_features)
    ]) # remaining columns will be dropped by default

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(criterion='entropy', random_state=0))
])

I then fit my model by calling

clf.fit(X_train, y_train)

Then, when I try to get feature importances,

clf.named_steps['classifier'].feature_importances_

I get an array of shape (104,)

array([1.39312528e-01, 1.92086014e-01, 1.15276068e-01, 4.01797967e-02,
       7.08805229e-02, 3.99687904e-03, 6.68727677e-03, 0.00000000e+00,
       1.02021005e-02, 5.06637671e-03, 7.97826949e-03, 5.64939616e-03,
       0.00000000e+00, 9.09583016e-04, 1.84022196e-03, 9.29047900e-04,
       1.74001682e-04, 8.55362503e-05, 2.32440522e-03, 4.65023589e-04,
       4.13278579e-03, 3.68265995e-03, 1.78503960e-02, 8.33035943e-03,
       6.94454768e-03, 1.75988171e-02, 5.40933687e-04, 7.51299294e-03,
       6.07480929e-03, 2.28627732e-03, 1.32219786e-03, 1.92990938e-01,
       1.18517448e-03, 1.61377248e-03, 5.72167000e-04, 1.34920904e-03,
       5.41685180e-03, 0.00000000e+00, 9.16416279e-03, 1.05824472e-02,
       3.07744966e-03, 3.07152204e-03, 5.06657379e-03, 5.21819782e-03,
       0.00000000e+00, 7.49534136e-03, 2.83936918e-03, 8.62398812e-03,
       5.78720378e-03, 5.37536831e-03, 2.99744077e-03, 1.87247908e-03,
       4.87696805e-04, 1.58422357e-03, 2.20761597e-03, 5.57396015e-03,
       1.17619435e-03, 1.87465473e-03, 4.08710965e-03, 6.73508851e-04,
       6.02887867e-03, 2.38887308e-03, 4.52029746e-03, 7.28018074e-05,
       5.13158297e-04, 2.66768058e-04, 0.00000000e+00, 3.28378333e-04,
       0.00000000e+00, 8.55362503e-05, 0.00000000e+00, 7.89886262e-04,
       1.84475320e-04, 1.37879652e-03, 0.00000000e+00, 3.27800552e-04,
       1.95189232e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 9.00792536e-04, 0.00000000e+00, 2.20606426e-04,
       5.82787439e-04, 4.85000896e-04, 5.33409400e-04, 0.00000000e+00,
       8.75840665e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       4.65546160e-04, 3.37472507e-04, 2.50837357e-04, 2.52474592e-04,
       0.00000000e+00, 1.47818105e-04, 3.06829767e-04, 3.73651596e-04,
       1.58778645e-04, 4.40566013e-03, 8.55362503e-05, 2.51672361e-04])

which is not correct as I only have 13 features. I know the reason for this is OneHotencoding.

How can get the actual feature importances?

Upvotes: 2

Views: 1641

Answers (2)

user1808924
user1808924

Reputation: 4926

Fundamentally, the importance of a data column can be obtained by summing the importances of all the features that are based on it. Identifying column-to-feature mappings could be a little difficult to do by hand, but you can always use automated tools for that.

For example, the SkLearn2PMML package can translate Scikit-Learn pipelines to PMML representation, and perform various analyses and transformations while doing so. The calculation of aggregate feature importances is well supported.

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
  ("preprocessor", preprocessor),
  ("classifier", clf)
])
pipeline.fit(X, y)
# Re-map the dynamic attribute to a static pickleable attribute
clf.pmml_feature_importances_ = clf.feature_importances_

sklearn2pmml(pipeline, "PipelineWithImportances.pmml.xml")

Upvotes: 0

desertnaut
desertnaut

Reputation: 60370

I am afraid you cannot get importances for your initial features here. Your decision tree does not know anything about them; the only thing it sees and knows about is the encoded ones, and nothing else.

You may want to try the permutation importance instead, which has several advantages over the tree-based feature importance; it is also easily applicable to pipelines - see Permutation importance using a Pipeline in SciKit-Learn.

Upvotes: 2

Related Questions