Reputation: 738
When doing regression or classification, what is the correct (or better) way to preprocess the data?
Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.
Upvotes: 23
Views: 32750
Reputation: 483
I got another reason in PCA objective function. May you see detail in this link enter link description here Assuming the X matrix has been normalized before PCA.
Upvotes: -2
Reputation: 1
the answer is the 3rd option as after doing pca we have to normalize the pca output as the whole data will have completely different standard. we have to normalize the dataset before and after PCA as it will more accuarate.
Upvotes: 0
Reputation: 11
Normalize the data at first. Actually some R packages, useful to perform PCA analysis, normalize data automatically before performing PCA. If the variables have different units or describe different characteristics, it is mandatory to normalize.
Upvotes: 0
Reputation: 47392
You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X
with a known correlation matrix C
:
>> C = [1 0.5; 0.5 1];
>> A = chol(rho);
>> X = randn(100,2) * A;
If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes:
>> wts=pca(X)
wts =
0.6659 0.7461
-0.7461 0.6659
If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change:
>> Y = X;
>> Y(:,1) = 100 * Y(:,1);
However, we now find that the principal components are aligned with the coordinate axes:
>> wts=pca(Y)
wts =
1.0000 0.0056
-0.0056 1.0000
To resolve this, there are two options. First, I could rescale the data:
>> Ynorm = bsxfun(@rdivide,Y,std(Y))
(The weird bsxfun
notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature).
We now get sensible results from PCA:
>> wts = pca(Ynorm)
wts =
-0.7125 -0.7016
0.7016 -0.7125
They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally.
The other option is to perform PCA using the correlation matrix of the data, instead of the outer product:
>> wts = pca(Y,'corr')
wts =
0.7071 0.7071
-0.7071 0.7071
In fact this is completely equivalent to standardizing the data by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).
Upvotes: 28
Reputation: 14701
You need to normalize the data first always. Otherwise, PCA or other techniques that are used to reduce dimensions will give different results.
Upvotes: 7