Reputation: 2371
I have got a problem to remove highly correlated components. Can I ask how to do this?
For example, I have got 40 instances with 20 features (random created). Feature 2 and 18 is highly correlated with feature 4. And feature 6 is highly correlated with feature 10. Then how to remove the highly correlated (redundant) features such as 2, 18 and 10? Essentially, I need the index of remaining features 1, 3, 4, 5, 6, ..., 9, 11, ..., 17, 19, 20.
Matlab codes:
x = randn(40,20);
x(:,2) = 2.*x(:,4);
x(:,18) = 3.*x(:,4);
x(:,6) = 100.*x(:,10);
x_corr = corr(x);
size(x_corr)
figure, imagesc(x_corr),colorbar
Correlation matrix x_corr
looks like
edit:
I worked out a way:
x_corr = x_corr - diag(diag(x_corr));
[x_corrX, x_corrY] = find(x_corr>0.8);
for i = 1:size(x_corrX,1)
xx = find(x_corrY == x_corrX(i));
x_corrX(xx,:) = 0;
x_corrY(xx,:) = 0;
end
x_corrX = unique(x_corrX);
x_corrX = x_corrX(2:end);
im = setxor(x_corrX, (1:20)');
Am I right? Or you have a better idea please post. Thanks.
edit2: Is this method the same as using PCA?
Upvotes: 3
Views: 5843
Reputation: 74
I think the answer of woodchips is quite good. But when you're using eigenvalues, you can run into some trouble. If the dataset is large enough, there will always be some small eigenvalues, but you won't be sure what they tell you.
Instead, consider grouping your data by a simple clustering method. It's easy to implement in Matlab.
http://www.mathworks.de/de/help/stats/cluster-analysis-1-1.html
edit:
If you disregard the points that woodchips made, you're solution is okay, as an algorithm.
Upvotes: 2
Reputation:
It seems quite clear that this idea of yours, to simply remove highly correlated variables from the analysis is NOT the same as PCA. PCA is a good way to do rank reduction of what seems to be a complicated problem, into one that turns out to have only a few independent things happening. PCA uses an eigenvalue (or svd) decomposition to achieve that goal.
Anyway, you might have a problem. For example, suppose that A is highly correlated to B, and B is highly correlated to C. However, it need not be true that A and C are highly correlated. Since correlation can be viewed as a measure of the angle between those vectors in their corresponding high dimensional vector space, this can be easily made to happen.
As a trivial example, I'll create two variables, A and B, that are correlated at a "moderate" level.
n = 50;
A = rand(n,1);
B = A + randn(n,1)/2;
corr([A,B])
ans =
1 0.55443
0.55443 1
So here 0.55 is the correlation. I'll create C to be virtually the average of A and B. It will be highly correlated by your definition.
C = [A + B]/2 + randn(n,1)/100;
corr([A,B,C])
ans =
1 0.55443 0.80119
0.55443 1 0.94168
0.80119 0.94168 1
Clearly C is the bad guy here. But if one were to simply look at the pair [A,C] and remove A from the analysis, then do the same with the pair [B,C] and then remove B, we would have made the wrong choices. And this was a trivially constructed example.
In fact, it is true that the eigenvalues of the correlation matrix might be of interest.
[V,D] = eig(corr([A,B,C]))
V =
-0.53056 -0.78854 -0.311
-0.57245 0.60391 -0.55462
-0.62515 0.11622 0.7718
D =
2.5422 0 0
0 0.45729 0
0 0 0.00046204
The fact that D has two significant diagonal elements, and a tiny one tells us that really, this is a two variable problem. What PCA will not easily tell us is which vector to simply remove though, and the problem would only be less clear with more variables, with many interactions between all of them.
Upvotes: 2