Reputation: 1
I am using SelectKBest in my pipeline and I want to be able to configure the number of features I want to select using a config.ini file. So essentially in the .ini file I have this :
# FeatureSelection: Set the number of features to select [FeatureSelection] nb_features = 10000 # Number of features to select for Kbest feature selection using chi2 (Integer or all, all will keep all features, therefore perform no feature selection)
So, the problem is that if I use a data input that isn't large enough to extract 10 000 features, selectKBest will encounter a problem :
ValueError: k should be <= n_features = 4873; got 10000. Use k='all' to return all features.
Which is normal since it can't find enough features to return 10 000 features. Now I was thinking about two approaches but I don't know which one is easily doable and most accurate?
Set the k_value dynamically so that if it's higher than the number of available features, it gets set to "all" so that it selects the maximum amount of features. Else, keep the k_value at what is set in the config.ini file.
Set the k_value based on a ratio extracted during training. Imagine you extract 12 000 features during training (.fit) and you keep 10 000, it would give you a ratio of 5/6. So that means during the prediction on validation data for instance, you would keep 5/6 * available features. So let's say 6000, you would set k_value to be 5000 to keep the same ratio of selected features in the end. However there should be a security as to if the ratio is > 1 it should set k_value to "all" also, since you can't select more features than were extracted.
What would you recommend ? I'd really appreciate it if someone lends me a hand on this one :D Thanks for your time !
Upvotes: 0
Views: 47