having OneHotEncoder to manage unseen values at transform step

Question

I am using sklearn.preprocessing.OneHotEncoder to encode categorical data of the form

A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])

Suppose I use A at the .fit(A) step and B at some point as new data to .transform(B). If B contains unseen values in respect to A, doing so produces a feature out of bounds error. Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?

ValueError: Feature out of bounds. Try setting n_values.

I understand I can change the feature bounds at .fit time. But if I am using A as training data, each time I got a new set B to predict, I would have to mess with my initial encoding.

Thanks.

Fred Foo · Accepted Answer

Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?

No, but it would be nice if OneHotEncoder did that, so I've opened an issue for this. For now, you'll just have to set n_values a bit higher.

having OneHotEncoder to manage unseen values at transform step

Answers (2)

Related Questions