user17375
user17375

Reputation: 529

having OneHotEncoder to manage unseen values at transform step

I am using sklearn.preprocessing.OneHotEncoder to encode categorical data of the form

A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])

Suppose I use A at the .fit(A) step and B at some point as new data to .transform(B). If B contains unseen values in respect to A, doing so produces a feature out of bounds error. Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?

ValueError: Feature out of bounds. Try setting n_values.

I understand I can change the feature bounds at .fit time. But if I am using A as training data, each time I got a new set B to predict, I would have to mess with my initial encoding.

Thanks.

Upvotes: 2

Views: 2286

Answers (2)

telekineser
telekineser

Reputation: 88

This feature is added to OneHotEncoder now. You can do this by setting the parameter handle_unknown='ignore'.

For example:

from sklearn.preprocessing import OneHotEncoder

A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])

onehot = OneHotEncoder(handle_unknown='ignore')
A = onehot.fit_transform(A)
B = onehot.transform(B)

Upvotes: 1

Fred Foo
Fred Foo

Reputation: 363547

Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?

No, but it would be nice if OneHotEncoder did that, so I've opened an issue for this. For now, you'll just have to set n_values a bit higher.

Upvotes: 3

Related Questions