Reputation: 1467
I have a classification problem and my current feature vector does not seem to hold enough information. My training set has 10k entries and I am using a SVM as classifier (scikit-learn).
What is the maximum reasonable feature vector size (how many dimension)?
(Training and evaluation using Labtop CPU)
100? 1k? 10k? 100k? 1M?
Upvotes: 0
Views: 2238
Reputation: 14072
The thing is not how many features should it be for a certain number of cases (i.e. entries) but rather the opposite:
It’s not who has the best algorithm that wins. It’s who has the most data. (Banko and Brill, 2001)
Banko and Brill in 2001 made a comparison among 4 different algorithms, they kept increasing the Training Set Size to millions and came up with the above-quoted conclusion.
Moreover, Prof. Andrew Ng clearly covered this topic, and I’m quoting here:
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
So as a rule of thumb, your data cases must be greater than the number of features in your dataset taking into account that all features should be informative as much as possible (i.e. the features are not highly collinear (i.e. redundant)).
I read once in more than one place and somewhere in Scikit-Learn Documentation, that the number of inputs (i.e. samples) must be at least the square size of the number of features (i.e. n_samples > n_features ** 2
).
Nevertheless, for SVM
in particular, the number of features n
v.s number of entries m
is an important factor to specify the type of kernel to use initially, as a second rule of thumb for SVM
in particular (also according to Prof. Andrew Ng):
n
is up to 10K and m
is up to 1K) --> use SVM
without a kernel (i.e. "linear kernel") or use Logistic Regression.n
is up to 1K and m
is up to 10K) --> use SVM
with Gaussian kernel.n
is up to 1K and m
> 50K) --> Create/add more features, then use SVM
without a kernel or use Logistic Regression.Upvotes: 4