Ravaal
Ravaal

Reputation: 3359

Which predictive models in sklearn are affected by the order of the columns in the training dataframe?

I'm wondering if any of the estimators that Sci-kit Learn provides is affected by the order of the columns in the dataframe by which it is being trained. I tried establishing a baseline by using ExtraTreesRegressor and it came out to 3 different scores:

Obviously ExtraTreesRegressor is not a good example here, so I tried LinearRegression but it gave .295898 no matter what the order of the columns were.

What I want to know is if there are ANY estimators that are affected by the order of the columns and if there are not then can you point me in the direction of some way, or provide some code, that I can use to make sure that the order of the columns does matter?

Upvotes: 4

Views: 732

Answers (1)

desertnaut
desertnaut

Reputation: 60319

Any algorithm that involves some randomness in selecting features while building the model is expected to be affected from their order; AFAIK, the only cases present in scikit-learn are the Extra Trees and the Random Forest (in both their incarnations as classifiers or regressors), which indeed share some similarities.

The smoking gun for such a behavior is the argument max_features; from the RF docs (the description is identical in the Extra Trees as well):

max_features : {“auto”, “sqrt”, “log2”} int or float, default=”auto”

The number of features to consider when looking for the best split

I am not aware of other algorithms that involve such kind of random feature selection (linear models, decision trees, SVMs, naive Bayes, neural nets, and gradient boosted trees do not), but if you glimpse something similar enough in the documentation, you can bet that the respective algorithm is also affected by the order of the features.

Keep in mind that such slight discrepancies that should not happen in theory are rather to be expected in models where randomness enters from way too many angles. For a similar case with RF in R (slightly different results when asking for importance=TRUE), check my answer in Why does the importance parameter influence performance of Random Forest in R?

Upvotes: 4

Related Questions