Reputation: 21
What I do: I analyse different biomarkers from EEG data with the help of different machine learning algorithms and different pre-processing steps etc. This results in several models for each combination of pre-processing step and algorithm. Each model is trained using StratifiedGroupKFold with a total of 6 folds.
Each fold is saved as a joblib as .joblib
The biomarkers: Each band of the EEG signal has a number of biomarkers. These biomarkers in turn consist of all signals from all electrodes of the EEG. A biomarker therefore consists of several features, which must not be separated (each biomarker must contain all electrode data).
What I would like to do: In my first approach, I trained each model with all biomarkers. I would now like to use a feature importance to find out whether I can omit some of them. To do this, I would like to look at each preprocessing step and each model.
I was recommended SHAP but my problem is that I don't know how to summarise the channel of each biomarker.
EDIT: I finally did summarize the folds, with the help of this Paper. But I still don't get how to summarize the channel per biomarker.
New Code:
for r, fold_file in enumerate(fold_files):
model = joblib.load(fold_file)
fold_splits = list(sgkf.split(X, y, groups))
for train_index, test_index in fold_splits:
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
explainer = shap.Explainer(model, X_train)
train_shap_values = explainer(X_train)
test_shap_values = explainer(X_test)
for i in range(len(train_index)):
train_folds_shap_values[train_index[i]] += train_shap_values.values[i] / (len(fold_splits) - 1)
for i in range(len(test_index)):
test_folds_shap_values[test_index[i]] += test_shap_values.values[i]
average_train_folds_shap_values = train_folds_shap_values / R
average_test_folds_shap_values = test_folds_shap_values / R
train_shap_df = pd.DataFrame(average_train_folds_shap_values, columns=columns)
test_shap_df = pd.DataFrame(average_test_folds_shap_values, columns=columns)
I first tried it like this:
grouped_features = group_features(columns, biomarker_names, bands)
def aggregate_shap_values(shap_df, grouped_features):
aggregated_shap_values = pd.DataFrame()
for group, features in grouped_features.items():
aggregated_shap_values[group] = shap_df[features].sum(axis=1)
return aggregated_shap_values
train_aggregated_shap_df = aggregate_shap_values(train_shap_df, grouped_features)
test_aggregated_shap_df = aggregate_shap_values(test_shap_df, grouped_features)
shap.summary_plot(train_aggregated_shap_df.values, feature_names=train_aggregated_shap_df.columns.tolist())
But it just looks... wrong. I am missing the distinction between the importance.
Thanks in adavance!
Upvotes: 1
Views: 188