Missak Boyajian
Missak Boyajian

Reputation: 2245

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:

  1. Neural Networks
  2. Logistics
  3. Naive
  4. Random Forest
  5. Adaboost

I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.

My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?

Upvotes: 4

Views: 1013

Answers (2)

Florian Mutel
Florian Mutel

Reputation: 1084

Since your purpose is to get some intuition on what's going on, here is what you can do:

Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.

After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.

For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.

Hope it helps!

Upvotes: 1

appletree
appletree

Reputation: 86

First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.

Some approaches that spring to mind:

  1. Mutual information based feature selection (eg here), independent of the classifier.
  2. Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
  3. Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
  4. Principal components analysis or any other dimensionality reduction technique that groups your features (example).
  5. Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).

Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:

  • Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
  • Random Forest: can return a variable importance index that ranks the variables from most to least important.

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.

This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions. So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.

Upvotes: 5

Related Questions