lserlohn
lserlohn

Reputation: 6216

Does it help to duplicate original data in order to make more data for building model?

I just got an interview question.

"Assume you want to build a statistical or machine learning model, but you have very limited data on hand. Your boss told you can duplicate original data several times, to make more data for building the model" Does it help?

Intuitively, it does not help, because duplicating original data doesn't create more "information" to feed the model.

But is there anyone can explain it more statistically? Thanks

Upvotes: 0

Views: 269

Answers (2)

Robert Dodier
Robert Dodier

Reputation: 17605

Well, it depends on exactly what one means by "duplicating the data".

If one is exactly duplicating the whole data set a number of times, then methods based on maximum likelihood (as with many models in common use) must find exactly the same result since the log likelihood function of the duplicated data is exactly a multiple of the unduplicated data's log likelihood, and therefore has the same maxima. (This argument doesn't apply to methods which aren't based on the likelihood function; I believe that CART and other tree models, and SVM's, are such models. In that case you'll have to work out a different argument.)

However, if by duplicating, one means duplicating the positive examples in a classification problem (which is common enough, since there are often many more negative examples than positive), then that does make a difference, since the likelihood function is modified.

Also if one means bootstrapping, then that, too, makes a difference.

PS. Probably you'll get more interest in this question on stats.stackexchange.com.

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77505

Consider e.g. variance. The data set with the duplicated data will have the exact same variance - you don't have a more precise estimate of the distrbution afterwards.

There are, however, some exceptions. For example bootstrap validation helps when evaluating your model, but you have very little data.

Upvotes: 1

Related Questions