Reputation: 113
Let say. We have a dataset (in .csv format) for supervised machine learning. It has 60 data points (row of data), and each data point has 100 variables.
Does it make sense that I train machine learning models using all 100 variables from 60 data points? To me, it seems that it is mathematically wrong. It like I solve an equation set that with 100 variables, but only 60 equations?
In a dataset, if we have n variables, what is the minimal number of data points we need to train a machine learning model?
Any statistic theory for this?
Thank you very much.
Upvotes: 1
Views: 143
Reputation: 1928
To answer your first question, you are right, it does not make sense to try to generalize a model with 100 features but only 60 examples.
The statistical reason has been widely explained in "statistical learning theory" by Vladimir Vapnik. I do not really suggest going and read all that book, it is a large book and lots of math, and not too many examples. But the point that you need to know is what is called Vapnik Chervonenkis dimension or most of the time, it is being called VC dimension.
But long story short, in cases where the dimension is bigger than the number of training examples, what you will get is not a generalization, but an overfitting
Upvotes: 1