Denys
Denys

Reputation: 4557

When classifying, does one need to normalize new incoming features when predicting on real data?

There are two data sets - the training one and a data set of features, labels for which are yet to be predicted (the new one).

I built a Random Forest classifier. Along the way I had to do two things:


Now I have two questions. When i am predicting labels for the new data:

  1. Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?

  2. How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?

In my case I possess both data sets, so I could calculate all this stuff in advance, but what if I only had a classifier and a new data set?

Upvotes: 1

Views: 1513

Answers (1)

IVlad
IVlad

Reputation: 43477

I only have a basic knowledge of the type of classifiers and normalization techniques you're using, but the general rule, that I think applies to what you're doing as well, is to do the following.

Your classifier is not a Random Forest Classifier. That is only one step of the pipeline that acts as your actual classifier. This pipeline / actual classifier is what you describe:

  1. Normalize continuous numeric features.
  2. Perform a one-hot-encoding on the categorical ones.
  1. Use a Random Forest Classifier on what you get from the first 2 steps.

This pipeline, that encompasses 3 things, is what you're actually using as your classifier.

Now, how does a classifier work?

  1. You build some state based on the training data.
  2. You use that state to make predictions on the test data.

So:

  1. Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?

Your classifier normalizes the incoming features for the training data, so it will normalize those for unseen instances too. To do this, it must use the state it has built during training.

For example, if you were doing min-max scaling on your features, your state would store a min(f) and max(f) for each feature f. Then, during testing / prediction, you would do min-max scaling for each feature f using the stored min(f) and max(f) values.

I'm not sure what you mean by "normalize continuous numeric features". Do you mean discretization? If you build some state for this discretization during training, then you need to find a way to factor that in.

How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?

Don't you know how many values each category can have beforehand? Usually you do (since categoricals are things like nationality, continent etc. - things you know in advance). If you can get a value for a categorical feature that you haven't seen during training, it begs the question if you should even care about it. What good is a categorical value you've never trained on?

Maybe add an "unknown" category. I think expanding for a single one should be fine, what good are more going to do if you've never trained on them?

What kind of categoricals do you have?

I could be wrong, but do you really need one-hot encoding? AFAIK, tree-based classifiers don't seem to benefit that much from it.

Upvotes: 2

Related Questions