Reputation: 95
for example, if I trained the model for these values
Column1 = A , Column2 = B , Column3 = C , Label = 10
Column1 = D , Column2 = E , Column3 = F , Label = 20
Column1 = G , Column2 = H , Column3 = I , Label = 30
What if i want to predict?
Column1 = A , Column2 = B , Column3 = Z
what the model do for that?
Upvotes: 0
Views: 98
Reputation: 8667
It depends on how you process the categorical data. If, for example, you used dictionary-based one-hot vectorizer:
new CategoricalOneHotVectorizer("Column2", "Column2", "Column3")
then the model will build a dictionary of terms per column: Column1 -> [A, D, G] Column2 -> [B, E, H] Column3 -> [C, F, I]
If the value has not been seen (is not present in a dictionary), the CategoricalOneHotVectorizer
assigns zero to all the 'one-hot' slots. So your example A B Z
will turn into 1 0 0 1 0 0 0 0 0
.
If, on the other hand, you use hash-based one-hot encoding:
new CategoricalHashOneHotVectorizer("Column2", "Column2", "Column3")
the incoming value Z will be hashed in the same way as the seen values C, F and I, and this will activate one of the 2^HashBits
slots of the output column, based on the value of the hash.
The doc on the CategoricalOneHotVectorizer
is not very clear on this one, but it still says:
The Key value is the one-based index of the slot set in the Ind/Bag options. If the Key option is not found, it is assigned the value zero.
Upvotes: 1