kmeans with sparse vectors in elki

Question

When i try this method with dense vectors data it's run correctly, but with sparse vectors data throws java.lang.ArrayIndexOutOfBoundsException. What datasource can i use to read sparse vectors data correctly?

public void runKmeans(double[][] data) {
ArrayAdapterDatabaseConnection dataArray = new ArrayAdapterDatabaseConnection(data);

ListParameterization params = new ListParameterization();
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dataArray);

Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();


// Parameterization
//params = new ListParameterization();
params = new ListParameterization();
params.addParameter(KMeans.K_ID, k);
params.addParameter(KMeans.SEED_ID, 0);


// setup Algorithm
KMeansOutlierDetection kmeansAlg = ClassGenericsUtil.parameterizeOrAbort(KMeansOutlierDetection.class, params);
//testParameterizationOk(params);

// run KMEANS on database
OutlierResult result = kmeansAlg.run(db);
...

Erich Schubert · Accepted Answer

The class ArrayAdapterDatabaseConnection can only be used for dense vectors. You must supply a square double[][] array.

You can use FileBasedDatabaseConnection and the ArffParser to read sparse data. Or you can implement your own DatabaseConnection, it is a single method only, loadData().

DoubleVector is a dense data type. SparseDoubleVector is a sparse vector type. To do this, DoubleVector is backed using a dense double[] array, whereas SparseDoubleVector uses a int[] with the nonzero dimensions, plus a double[] with the nonzero values only.

K-means requires a fixed dimensionality to allocate the mean vectors (these will always be dense), so make sure to supply a VectorFieldTypeInformation with the maximum dimensionality. There is a type conversion filter that simply scans you data set once, and sets the dimension accordingly.

kmeans with sparse vectors in elki

Answers (1)

Related Questions