Reputation: 5102
For input data of different scale I understand that the values used to train the classifier has to be normalized for correct classification(SVM).
So does the input vector for prediction also needs to be normalized?
The scenario that I have is that the training data is normalized and serialized and saved in the database, when a prediction has to be done the serialized data is deserialized to get the normalized numpy array, and the numpy array is then fit on the classifier and the input vector for prediction is applied for prediction. So does this input vector also needs to be normalized? If so how to do it, since at the time of prediction I don't have the actual input training data to normalize?
Also I am normalizing along axis=0 , i.e. along the column.
my code for normalizing is :
preprocessing.normalize(data, norm='l2',axis=0)
is there a way to serialize preprocessing.normalize
Upvotes: 4
Views: 5819
Reputation: 1131
In SVMs it is recommended a scaler for several reasons.
When you put the features in the same scale you must remove the mean and divide by the standard deviation.
xi - mi
xi -> ------------
sigmai
You must storage the mean and standard deviation of every feature in the training set to use the same operations in future data.
In python you have functions to do that for you:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
To obtain means and standar deviations:
scaler = preprocessing.StandardScaler().fit(X)
To normalize then the training set (X is a matrix where every row is a data and every column a feature):
X = scaler.transform(X)
After the training, you must normalize of future data before the classification:
newData = scaler.transform(newData)
Upvotes: 6