Does it help to duplicate original data in order to make more data for building model?

Question

I just got an interview question.

"Assume you want to build a statistical or machine learning model, but you have very limited data on hand. Your boss told you can duplicate original data several times, to make more data for building the model" Does it help?

Intuitively, it does not help, because duplicating original data doesn't create more "information" to feed the model.

But is there anyone can explain it more statistically? Thanks

Robert Dodier · Accepted Answer

Well, it depends on exactly what one means by "duplicating the data".

If one is exactly duplicating the whole data set a number of times, then methods based on maximum likelihood (as with many models in common use) must find exactly the same result since the log likelihood function of the duplicated data is exactly a multiple of the unduplicated data's log likelihood, and therefore has the same maxima. (This argument doesn't apply to methods which aren't based on the likelihood function; I believe that CART and other tree models, and SVM's, are such models. In that case you'll have to work out a different argument.)

However, if by duplicating, one means duplicating the positive examples in a classification problem (which is common enough, since there are often many more negative examples than positive), then that does make a difference, since the likelihood function is modified.

Also if one means bootstrapping, then that, too, makes a difference.

PS. Probably you'll get more interest in this question on stats.stackexchange.com.

Does it help to duplicate original data in order to make more data for building model?

Answers (2)

Related Questions