Nan Zhou
Nan Zhou

Reputation: 1285

How to cluster very big sparse data set using low memory in Python?

I have data which forms a sparse matrix in shape of 1000 x 1e9. I want to cluster the 1000 examples into 10 clusters using K-means.

The matrix is very sparse, less than 1/1e6 values.

My laptop got 16 RAM. I tried sparse matrix in scipy. Unfortunately, the matrix makes the clustering process need much more memory than I have. Could anyone suggest a way to do this?

My system crashed when running the following test snippet

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans

row = np.array([0, 0, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 7, 8, 8, 8])
col = np.array([0, 2, 2, 0, 1, 2] * 3)
data = np.array([1, 2, 3, 4, 5, 6] * 3)
X = csr_matrix((data, (row, col)), shape=(9, 1e9))

resC = KMeans(n_clusters=3).fit(X)
resC.labels_

Any helpful suggestion is appreciated.

Upvotes: 2

Views: 3511

Answers (4)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

KMeans centers will not be sparse anymore, so this would need careful optimization for the sparse case (that may be costly for the usual case, so it probably isn't optimized this way).

You can try ELKI (not python but Java) which often is much faster, and also has sparse data types. You can also try using single-precision float will also help.

But in the end, the results will be questionable: k-means is statistically rooted in least-squares. It assumes your data is coming from k signals plus some Gaussian error. Because your data is sparse, it obviously does not have this kind of Gaussian shape. When the majority of values is 0, it cannot be a Gaussian.

With just 1000 data points, I'd rather use HAC.

Upvotes: 3

myrtlecat
myrtlecat

Reputation: 2276

Although KMeans accepts sparse matrices as input, the centroids used within the algorithm have a dense representation, and your feature space is so big that even 10 centroids will not fit into 16GB of RAM.

I have 2 ideas:

  1. Can you fit the clustering into RAM if you discard all empty columns? If you have 1000 samples and only about 1/1e6 values are occupied, then probably less than 1 in 1000 columns will contain any non-zero entries.
  2. Several clustering algorithms in scikit-learn will accept a matrix of distances between samples in stead of the full data e.g. sklearn.cluster.SpectralClustering. You could precompute the pairwise distances in a 1000x1000 matrix and pass that to your clustering algorithm in stead. (I can't make a specific recommendation of a clustering method, or a suitable function to calculate the distances, as it will depend on your application)

Upvotes: 1

sascha
sascha

Reputation: 33532

Whatever you do (for your data; given your memory-constraints): kmeans is not ready for that!

This includes:

  • Online KMeans / MiniBatch Kmeans; as proposed in another answer
    • it only helps to handle many samples (and is hurt by the same effect mentioned later)!
  • Various KMeans-implementation in different languages (it's an algorithmic problem; not bound by an implementation)

Ignoring potential theoretic reasons (high-dimensionality and non-convex heuristic optimization) i'm just mentioning the practical problem here:

  • your centroids might become non-sparse! (mentioned in sidenote by SOs clustering-expert; this link also mentions alternatives!)
    • this means: the sparse data-structures used will get very non-sparse and eventually blow up your memory!
    • (i changed sklearn's code to observe what the above link already mentioned)
      • relevant sklearn code: center_shift_total = squared_norm(centers_old - centers)

Even if you remove / turn-off all the memory-heavy components like:

  • init=some_sparse_ndarray (instead of k-means++)

  • n_init=1 instead of 10

  • precompute_distances=False instead of True (unclear if it helps)

  • n_jobs=1 instead of -1

the above will be your problem to care!

Upvotes: 1

Pedro
Pedro

Reputation: 1121

Consider using dict, since it will only store the values wich were assigned. I guess a nice way to do this is by creating a SparseMatrix object like this:

class SparseMatrix(dict):
    def __init__(self, mapping=[]):
        dict.__init__(self, {i:mapping[i] for i in range(len(mapping))})

    #overriding this method makes never-accessed indexes return 0.0
    def __getitem__(self, i):
        try:
            return dict.__getitem__(self, i)
        except KeyError:
            return 0.0

>>> my_matrix = SparseMatrix([1,2,3])
>>> my_matrix[0]
1
>>> my_matrix[5]
0.0

Edit:

For the multi-dimensional case you may need to override the two item-management methods as follows:

def __getitem__(self, ij):
    i,j = ij
    dict.__setitem__(i*self.n + j)

def __getitem__(self, ij):
    try:
        i,j = ij
        return dict.__getitem__(self, i*self.n + j)
    except KeyError:
        return 0.0

>>> my_matrix[0,0] = 10
>>> my_matrix[1,2]
0.0
>>> my_matrix[0,0]
10

Also assuming you defined self.n as the length of the matrix rows.

Upvotes: -3

Related Questions