Reputation: 333
The original data is Y
, the size of Y
is L*n
( n
is the number of features; L
is the number of observations. B
is the covariance matrix of the original data Y
. Suppose A
is the eigenvectors of the covariance matrix B
. I represent A
as A = (e1, e2,...,en)
, where ei
is an eigenvector. Matrix Aq
is the first q
eigenvectors and ai
be the row vectors of Aq
: Aq = (e1,e2,...,eq) = (a1,a2,...,an)'
. I want to apply the k-means algorithm to Aq
to cluster the row vector ai
to k
clusters or more (note: I do not want to apply k-means algorithm to the eigenvector ei
to k
clusters). For each cluster, only the vector closest to the center of cluster is retained, and the feature corresponding to this vector is finally selected as the informative features.
My question is:
1) What is the difference between applying the k-means algorithm to Aq
to cluster the row vector ai
to k
clusters and applying k-means algorithm to Aq
to cluster the eigenvector ei
to k
clusters?
2) the closest_vectors
I get is from this command: closest_vectors = Aq(min_idxs, :)
, the size of the closest_vectors
is k*q
double. How to get the final informative features? Since the final informative features have to be obtained from the original data Y
.
Thanks!
I found two function about pca and pfa:
function [e m lambda, sqsigma] = cvPca(X, M)
[D, N] = size(X);
if ~exist('M', 'var') || isempty(M) || M == 0
M = D;
end
M = min(M,min(D,N-1));
%% mean subtraction
m = mean(X, 2); %%% calculate the mean of every row
X = X - repmat(m, 1, N);
%% singular value decomposition. X = U*S*V.' or X.' = V*S*U.'
[U S V] = svd(X,'econ');
e = U(:,1:M);
if nargout > 2
s = diag(S);
s = s(1:min(D,N-1));
lambda = s.^2 / N; % biased (1/N) estimator of variance
end
% sqsigma. Used to model distribution of errors by univariate Gaussian
if nargout > 3
d = cvPcaDist(X, e, m); % Use of validation set would be better
N = size(d,2);
sqsigma = sum(d) / N; % or (N-1) unbiased est
end
end
%/////////////////////////////////////////////////////////////////////////////
function [IDX, Me] = cvPfa(X, p, q)
[D, N] = size(X);
if ~exist('p', 'var') || isempty(p) || p == 0
p = D;
end
p = min(p, min(D, N-1));
if ~exist('q', 'var') || isempty(q)
q = p - 1;
end
%% PCA step
[U Me, Lambda] = cvPca(X, q);
%% cluter row vectors (q x D). not col
[Cl, Mu] = kmeans(U, p, 'emptyaction', 'singleton', 'distance', 'sqEuclidean');
%% find axis which are nearest to mean vector
IDX = logical(zeros(D,1));
for i = 1:p
Cli = find(Cl == i);
d = cvEucdist(Mu(i,:).', U(Cli,:).');
[mini, argmin] = min(d);
IDX(Cli(argmin)) = 1;
end
Upvotes: 0
Views: 121
Reputation: 104483
Summarizing Olologin's comments, it doesn't make sense to cluster the eigenvectors of the covariance matrix, or the columns of the U
matrix of the SVD. Eigenvectors in this case are all orthogonal so if you tried to cluster them, you would only get one member per cluster and this cluster's centroid is defined by the eigenvector itself.
Now, what you're really after is selecting out the features in your data matrix that describe your data in terms of discriminatory analysis.
The functions that you have provided both compute the SVD and pluck out the k
principal components of your data and also determine which features out of these k
to select as the most prominent. By default, the amount of features to select out is equal to k
, but you can override this if you want. Let's just stick with the default.
The cvPfa
function performs this feature selection for you, but a warning to you that the data matrix in the function is organized where each row is a feature and each column is a sample. The output is a logical
vector that tells you which features are the strongest to select in your data.
Simply put, you just do this:
k = 10; %// Example
IDX = cvPfa(Y.', k);
Ynew = Y(:,IDX);
This code will choose the 10 most prominent features in your data matrix and pluck out those 10 features that are the most representative of your data, or the most discriminative. You can then use the output for whatever application you're targetting.
Upvotes: 1
Reputation: 9390
1) I don't think that clustering eigenvectors (columns of PCA result) of covariance matrix makes any sense. All eigenvectors pairwise orthogonal and equally far one from another in sense of Euclidian distance. You can pick any eigenvectors and compute distance between them, distance will be sqrt(2) between any pair. But clustering rows of PCA result can provide something useful.
Upvotes: 1