Zachai
Zachai

Reputation: 21

How to use SHAP Values for grouped Feature Importance?

What I do: I analyse different biomarkers from EEG data with the help of different machine learning algorithms and different pre-processing steps etc. This results in several models for each combination of pre-processing step and algorithm. Each model is trained using StratifiedGroupKFold with a total of 6 folds.

Each fold is saved as a joblib as .joblib

The biomarkers: Each band of the EEG signal has a number of biomarkers. These biomarkers in turn consist of all signals from all electrodes of the EEG. A biomarker therefore consists of several features, which must not be separated (each biomarker must contain all electrode data).

What I would like to do: In my first approach, I trained each model with all biomarkers. I would now like to use a feature importance to find out whether I can omit some of them. To do this, I would like to look at each preprocessing step and each model.

I was recommended SHAP but my problem is that I don't know how to summarise the channel of each biomarker.

EDIT: I finally did summarize the folds, with the help of this Paper. But I still don't get how to summarize the channel per biomarker.

New Code:

for r, fold_file in enumerate(fold_files):
    model = joblib.load(fold_file)
    
    fold_splits = list(sgkf.split(X, y, groups))
    
    for train_index, test_index in fold_splits:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        explainer = shap.Explainer(model, X_train)
        train_shap_values = explainer(X_train)
        test_shap_values = explainer(X_test)

        for i in range(len(train_index)):
            train_folds_shap_values[train_index[i]] += train_shap_values.values[i] / (len(fold_splits) - 1)
        for i in range(len(test_index)):
            test_folds_shap_values[test_index[i]] += test_shap_values.values[i]

average_train_folds_shap_values = train_folds_shap_values / R
average_test_folds_shap_values = test_folds_shap_values / R

train_shap_df = pd.DataFrame(average_train_folds_shap_values, columns=columns)
test_shap_df = pd.DataFrame(average_test_folds_shap_values, columns=columns)

I first tried it like this:

grouped_features = group_features(columns, biomarker_names, bands)

def aggregate_shap_values(shap_df, grouped_features):
    aggregated_shap_values = pd.DataFrame()
    for group, features in grouped_features.items():
        aggregated_shap_values[group] = shap_df[features].sum(axis=1)
    return aggregated_shap_values

train_aggregated_shap_df = aggregate_shap_values(train_shap_df, grouped_features)
test_aggregated_shap_df = aggregate_shap_values(test_shap_df, grouped_features)


shap.summary_plot(train_aggregated_shap_df.values, feature_names=train_aggregated_shap_df.columns.tolist())

But it just looks... wrong. I am missing the distinction between the importance.

Grouped: enter image description here

Not Grouped: enter image description here

Thanks in adavance!

Upvotes: 1

Views: 188

Answers (0)

Related Questions