Reputation: 112
I'm doing a KMean clustering on a 12 dimensional matrix. I managed to get the result in K set of cluster. I want to show the result by plotting it into a 2D graph, but I can't figure it out how can I convert the 12 dimension data into 2 dimension.
Any suggestion on how can I do the conversion or any alternative ways on visualizing the result? I tried Multidimensional Scaling for Java (MDSJ) but it did not work.
The KMean algorithm I'm using was from the Java Machine Learning Library: Clustering basics.
Upvotes: 0
Views: 2078
Reputation: 4213
I would do Principal Component Analysis (probably the easiest algorithm from Multidimensional scaling algorithms). (BTW PCA has nothing to do with KMeans, it is a general method for dimensionality reduction)
I assume variables are in columns, observations are in rows.
Standardize the data - convert variables to z-scores. That means: from each cell, subtract the mean of the column and devide the result by the std. deviation of the column. That way you get zero mean and unit variance. The former is obligatory, the latter, I would say, good to do. If you have zero variance, you calculate the eigen-vectors from the covariance matrix, otherwise have to use correlation matrix which kind of standardizes the data automatically. See this for explanation).
Calculate eigen-vectors and eigen-values of the covariance matrix. Sort the eigen-vectors by the eigen-values. (Many libraries already give you eigen-vectors sorted that way).
Use first two columns of the eigen-vector matrix and multiply the original matrix (converted to z-scores), visualize this data.
Using the colt library, you can do the following. It will be similar with other matrix libraries:
import cern.colt.matrix.DoubleMatrix1D;
import cern.colt.matrix.DoubleMatrix2D;
import cern.colt.matrix.doublealgo.Statistic;
import cern.colt.matrix.impl.SparseDoubleMatrix2D;
import cern.colt.matrix.linalg.Algebra;
import cern.colt.matrix.linalg.EigenvalueDecomposition;
import hep.aida.bin.DynamicBin1D;
public class Pca {
// to show matrix creation, it does not make much sense to calculate PCA on random data
public static void main(String[] x) {
double[][] data = {
{2.0,4.0,1.0,4.0,4.0,1.0,5.0,5.0,5.0,2.0,1.0,4.0},
{2.0,6.0,3.0,1.0,1.0,2.0,6.0,4.0,4.0,4.0,1.0,5.0},
{3.0,4.0,4.0,4.0,2.0,3.0,5.0,6.0,3.0,1.0,1.0,1.0},
{3.0,6.0,3.0,3.0,1.0,2.0,4.0,6.0,1.0,2.0,4.0,4.0},
{1.0,6.0,4.0,2.0,2.0,2.0,3.0,4.0,6.0,3.0,4.0,1.0},
{2.0,5.0,5.0,3.0,1.0,1.0,6.0,6.0,3.0,2.0,6.0,1.0}
};
DoubleMatrix2D matrix = new DenseDoubleMatrix2D(data);
DoubleMatrix2D pm = pcaTransform(matrix);
// print the first two dimensions of the transformed matrix - they capture most of the variance of the original data
System.out.println(pm.viewPart(0, 0, pm.rows(), 2).toString());
}
/** Returns a matrix in the space of principal components, take the first n columns */
public static DoubleMatrix2D pcaTransform(DoubleMatrix2D matrix) {
DoubleMatrix2D zScoresMatrix = toZScores(matrix);
final DoubleMatrix2D covarianceMatrix = Statistic.covariance(zScoresMatrix);
// compute eigenvalues and eigenvectors of the covariance matrix (flip needed since it is sorted by ascending).
final EigenvalueDecomposition decomp = new EigenvalueDecomposition(covarianceMatrix);
// Columns of Vs are eigenvectors = principal components = base of the new space; ordered by decreasing variance
final DoubleMatrix2D Vs = decomp.getV().viewColumnFlip();
// eigenvalues: ev(i) / sum(ev) is the percentage of variance captured by i-th column of Vs
// final DoubleMatrix1D ev = decomp.getRealEigenvalues().viewFlip();
// project the original matrix to the pca space
return Algebra.DEFAULT.mult(zScoresMatrix, Vs);
}
/**
* Converts matrix to a matrix of z-scores (by columns)
*/
public static DoubleMatrix2D toZScores(final DoubleMatrix2D matrix) {
final DoubleMatrix2D zMatrix = new SparseDoubleMatrix2D(matrix.rows(), matrix.columns());
for (int c = 0; c < matrix.columns(); c++) {
final DoubleMatrix1D column = matrix.viewColumn(c);
final DynamicBin1D bin = Statistic.bin(column);
if (bin.standardDeviation() == 0) { // use epsilon
for (int r = 0; r < matrix.rows(); r++) {
zMatrix.set(r, c, 0.0);
}
} else {
for (int r = 0; r < matrix.rows(); r++) {
double zScore = (column.get(r) - bin.mean()) / bin.standardDeviation();
zMatrix.set(r, c, zScore);
}
}
}
return zMatrix;
}
}
You could also use weka. I would first load your data into weka, then run PCA using the GUI (under attribute selection). You will see what classes are called with what parameters and then do the same thing from your code. The problem is you will need to convert/wrap your matrix into the data format weka works with.
Upvotes: 1
Reputation: 5543
In addition to what the other answers suggest you should probably have a look at multidimensional scaling too.
Upvotes: 0
Reputation: 3577
A similar question has been discussed on CrossValidated2. The basic idea is to find an appropriate projection that separates these clusters (e.g., with discproj
in R
) and then to plot the projection on the clusters on the new space.
Upvotes: 0