Lenoir
Lenoir

Reputation: 113

supervised machine learning: relationship between number of data points and variables

Let say. We have a dataset (in .csv format) for supervised machine learning. It has 60 data points (row of data), and each data point has 100 variables.

Does it make sense that I train machine learning models using all 100 variables from 60 data points? To me, it seems that it is mathematically wrong. It like I solve an equation set that with 100 variables, but only 60 equations?

In a dataset, if we have n variables, what is the minimal number of data points we need to train a machine learning model?

Any statistic theory for this?

Thank you very much.

Upvotes: 1

Views: 143

Answers (1)

alift
alift

Reputation: 1928

To answer your first question, you are right, it does not make sense to try to generalize a model with 100 features but only 60 examples.

The statistical reason has been widely explained in "statistical learning theory" by Vladimir Vapnik. I do not really suggest going and read all that book, it is a large book and lots of math, and not too many examples. But the point that you need to know is what is called Vapnik Chervonenkis dimension or most of the time, it is being called VC dimension.

But long story short, in cases where the dimension is bigger than the number of training examples, what you will get is not a generalization, but an overfitting

Upvotes: 1

Related Questions