user836026
user836026

Reputation: 11340

Dimensions Reduction in Matlab using PCA

I have a matrix with 35 columns and I'm trying to reduce the dimension using PCA. I run PCA on my data:

[coeff,score,latent,tsquared,explained,mu] = pca(data);
explained =
     99.9955
      0.0022
      0.0007
      0.0003
      0.0002
      0.0001
      0.0001
      0.0001

Then, by looking at the vector explained, I notice the value of the first element is 99. Based on this, I decided to take only the first compoenet. So I did the follwoing:

k=1;
X = bsxfun(@minus, data, mean(data)) * coeff(:, 1:k); 

and now, I used X for SVM training:

svmStruct = fitcsvm(X,Y,'Standardize',true, 'Prior','uniform','KernelFunction','linear','KernelScale','auto','Verbose',0,'IterationLimit', 1000000);

However, when I tried to predict and calculate the miss-classification rate:

[label,score,cost] = predict(svmStruct, X);

the result was disappointing. I notice, when I select only one component (k=1), I all classification was wrong. However, as I increase number of included components, k, the result improves, as you can see from the diagram below. But this doesn't make sense according to explained, which indicates that I should be fine with only the first eigenvector.

Did I do any mistake?

This diagram shows the classification error as a function of the number of included eginvectors: plot

This graph is generated after by doing normalization before doing PCA as suggested by @zelanix: enter image description here

This is also plotted graph: enter image description here

and this explained values obtained after doing normalization before PCA:

>> [coeff,score,latent,tsquared,explained,mu] = pca(data_normalised);
Warning: Columns of X are linearly dependent to within machine precision.
Using only the first 27 components to compute TSQUARED. 
> In pca>localTSquared (line 501)
  In pca (line 347) 
>> explained

explained =

   32.9344
   15.6790
    5.3093
    4.7919
    4.0905
    3.8655
    3.0015
    2.7216
    2.6300
    2.5098
    2.4275
    2.3078
    2.2077
    2.1726
    2.0892
    2.0425
    2.0273
    1.9135
    1.8809
    1.7055
    0.8856
    0.3390
    0.2204
    0.1061
    0.0989
    0.0334
    0.0085
    0.0000
    0.0000
    0.0000
    0.0000
    0.0000
    0.0000
    0.0000
    0.0000

Upvotes: 0

Views: 927

Answers (2)

zelanix
zelanix

Reputation: 3562

Parag S. Chandakkar is absolutely right that there is no reason to expect that PCA will automatically improve your classification result. It is an unsupervised method so is not intended to improve separability, only to find the components with the largest variance.

But there are some other problems with your code. In particular, this line confuses me:

X = bsxfun(@minus, data, mean(data)) * coeff(:, 1:k);

You need to normalise your data before performing PCA, and each feature needs to be normalised separately. I use the following:

data_normalised = data;

for f = 1:size(data, 2)
    data_normalised(:, f) = data_normalised(:, f) - nanmean(data_normalised(:, f));
    data_normalised(:, f) = data_normalised(:, f) / nanstd(data_normalised(:, f));
end

pca_coeff = pca(data_normalised);

data_pca = data_normalised * pca_coeff;

You can then extract the first principal component as data_pca(:, 1).

Also, always plot your PCA results to get an idea of what is actually going on:

figure
scatter(data_pca(Y == 1, 1), data_pca(Y == 1, 2))
hold on;
scatter(data_pca(Y == 2, 1), data_pca(Y == 2, 2))

Upvotes: 2

Autonomous
Autonomous

Reputation: 9075

PCA gives the direction of maximum variance in the data, it does not necessarily have to do better classification. If you want to reduce your data while trying to maximize your accuracy, you should do LDA.

The following picture illustrates exactly what I want to convey.

enter image description here

Upvotes: 2

Related Questions