Reputation: 1957
Let's say we have some data (input) with which we want to predict some output. If the possible values that a specific input can take has changed over time, is it still appropriate to use all of the data?
Let me try to clarify with an example. Suppose that one of the inputs is a categorical variable that has the unique values [A, B, C]
in the data, but we know for a fact that in the current setting in which we will ultimately make predictions, only the values [A, B]
are possible.
Would it still be appropriate to use all of the data, or should all of the observations that include a C
be excluded?
Upvotes: 1
Views: 49
Reputation: 14062
If C
does not uniquely map to the Target variable, but rather it shares some target variables with A
or/and B
. In this case, leaving C
in the dataset, knowing that it'll definitely not occur in the future input (i.e. where you predict for unseen inputs), will adjust the hypothesis of the model (and that depends on the model, linear models are more prone to this) and the final hypothesis will consequently be based on redundant information.
In simple terms: In-Sample does not represent the Out-of-Sample, so it will overfit and won't generalize!.
Upvotes: 1