Reputation: 131
I have a set of features which and I wish to rank according to their Correlation Coefficient with each other, without accounting for the true label (that would by a Supervised feature selection, right?). My objective is selecting the first feature as the one more correlated with every other, take it out and so on.
The problem is how to test the correlation of a vector with a matrix (all the other vectors/features)? Is it possible to do this or am I doing this all right.
PS: I'm using MATLAB 2013b
Thank you all
Upvotes: 1
Views: 1885
Reputation: 124563
Say you had a n-by-d
matrix X
where the rows are instances and columns are the features/dimensions, then you can compute the correlation coefficient matrix simply using the corr
or corrcoeff
functions:
% Fisher Iris dataset, 150x4
>> load fisheriris
>> X = meas;
>> C = corr(X)
C =
1.0000 -0.1176 0.8718 0.8179
-0.1176 1.0000 -0.4284 -0.3661
0.8718 -0.4284 1.0000 0.9629
0.8179 -0.3661 0.9629 1.0000
The result is a d-by-d
matrix containing correlation coefficients of each feature against every other feature. The diagonal is thus all ones (because corr(x,x) = 1
), the matrix is also symmetric (because corr(x,y) = corr(y,x)
). Values range from -1
to 1
, where -1
means inverse correlation between two variables, 1
means positive correlation, and 0
means no linear correlation.
Now because you want to remove the feature which is on average the most correlated with other features, you have to summarize that matrix as one number per feature. One way to do that is to compute the mean:
% mean
>> mean_corr = mean(C)
mean_corr =
0.6430 0.0220 0.6015 0.6037
% most correlated feature on average
>> [~,idx] = max(mean_corr)
idx =
1
% drop that feature
>> X(:,idx) = [];
I probably should have taken the mean of the absolute value of C
in the above code, because we don't care if two variables are positively or negatively correlated, only how strong the correlation is.
Upvotes: 2