Hahanaki
Hahanaki

Reputation: 1

Dynamically set K value of SelectKBest

I am using SelectKBest in my pipeline and I want to be able to configure the number of features I want to select using a config.ini file. So essentially in the .ini file I have this :

# FeatureSelection: Set the number of features to select [FeatureSelection] nb_features = 10000 # Number of features to select for Kbest feature selection using chi2 (Integer or all, all will keep all features, therefore perform no feature selection)

So, the problem is that if I use a data input that isn't large enough to extract 10 000 features, selectKBest will encounter a problem :

ValueError: k should be <= n_features = 4873; got 10000. Use k='all' to return all features.

Which is normal since it can't find enough features to return 10 000 features. Now I was thinking about two approaches but I don't know which one is easily doable and most accurate?

  1. Set the k_value dynamically so that if it's higher than the number of available features, it gets set to "all" so that it selects the maximum amount of features. Else, keep the k_value at what is set in the config.ini file.

  2. Set the k_value based on a ratio extracted during training. Imagine you extract 12 000 features during training (.fit) and you keep 10 000, it would give you a ratio of 5/6. So that means during the prediction on validation data for instance, you would keep 5/6 * available features. So let's say 6000, you would set k_value to be 5000 to keep the same ratio of selected features in the end. However there should be a security as to if the ratio is > 1 it should set k_value to "all" also, since you can't select more features than were extracted.

What would you recommend ? I'd really appreciate it if someone lends me a hand on this one :D Thanks for your time !

Upvotes: 0

Views: 47

Answers (0)

Related Questions