Remove highly correlated components

Question

I have got a problem to remove highly correlated components. Can I ask how to do this?

For example, I have got 40 instances with 20 features (random created). Feature 2 and 18 is highly correlated with feature 4. And feature 6 is highly correlated with feature 10. Then how to remove the highly correlated (redundant) features such as 2, 18 and 10? Essentially, I need the index of remaining features 1, 3, 4, 5, 6, ..., 9, 11, ..., 17, 19, 20.

Matlab codes:

x = randn(40,20);
x(:,2) = 2.*x(:,4);
x(:,18) = 3.*x(:,4);
x(:,6) = 100.*x(:,10);
x_corr = corr(x);
size(x_corr)

figure, imagesc(x_corr),colorbar

Correlation matrix x_corr looks like

Correlation matrix <code>x_corr</code>

edit:

I worked out a way:

x_corr = x_corr - diag(diag(x_corr));
[x_corrX, x_corrY] = find(x_corr>0.8);

for i = 1:size(x_corrX,1)
    xx = find(x_corrY == x_corrX(i));
    x_corrX(xx,:) = 0;
    x_corrY(xx,:) = 0;
end
x_corrX = unique(x_corrX);
x_corrX = x_corrX(2:end);
im = setxor(x_corrX, (1:20)');

Am I right? Or you have a better idea please post. Thanks.

edit2: Is this method the same as using PCA?

user85109 · Accepted Answer

It seems quite clear that this idea of yours, to simply remove highly correlated variables from the analysis is NOT the same as PCA. PCA is a good way to do rank reduction of what seems to be a complicated problem, into one that turns out to have only a few independent things happening. PCA uses an eigenvalue (or svd) decomposition to achieve that goal.

Anyway, you might have a problem. For example, suppose that A is highly correlated to B, and B is highly correlated to C. However, it need not be true that A and C are highly correlated. Since correlation can be viewed as a measure of the angle between those vectors in their corresponding high dimensional vector space, this can be easily made to happen.

As a trivial example, I'll create two variables, A and B, that are correlated at a "moderate" level.

n = 50;
A = rand(n,1);
B = A + randn(n,1)/2;
corr([A,B])
ans =
            1      0.55443
      0.55443            1

So here 0.55 is the correlation. I'll create C to be virtually the average of A and B. It will be highly correlated by your definition.

C = [A + B]/2 + randn(n,1)/100;
corr([A,B,C])
ans =
            1      0.55443      0.80119
      0.55443            1      0.94168
      0.80119      0.94168            1

Clearly C is the bad guy here. But if one were to simply look at the pair [A,C] and remove A from the analysis, then do the same with the pair [B,C] and then remove B, we would have made the wrong choices. And this was a trivially constructed example.

In fact, it is true that the eigenvalues of the correlation matrix might be of interest.

[V,D] = eig(corr([A,B,C]))
V =
     -0.53056     -0.78854       -0.311
     -0.57245      0.60391     -0.55462
     -0.62515      0.11622       0.7718
D =
       2.5422            0            0
            0      0.45729            0
            0            0   0.00046204

The fact that D has two significant diagonal elements, and a tiny one tells us that really, this is a two variable problem. What PCA will not easily tell us is which vector to simply remove though, and the problem would only be less clear with more variables, with many interactions between all of them.

Remove highly correlated components

Answers (2)

Related Questions