Reputation: 1285
I have data which forms a sparse matrix in shape of 1000 x 1e9. I want to cluster the 1000 examples into 10 clusters using K-means.
The matrix is very sparse, less than 1/1e6 values.
My laptop got 16 RAM. I tried sparse matrix in scipy. Unfortunately, the matrix makes the clustering process need much more memory than I have. Could anyone suggest a way to do this?
My system crashed when running the following test snippet
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
row = np.array([0, 0, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 7, 8, 8, 8])
col = np.array([0, 2, 2, 0, 1, 2] * 3)
data = np.array([1, 2, 3, 4, 5, 6] * 3)
X = csr_matrix((data, (row, col)), shape=(9, 1e9))
resC = KMeans(n_clusters=3).fit(X)
resC.labels_
Any helpful suggestion is appreciated.
Upvotes: 2
Views: 3511
Reputation: 77454
KMeans centers will not be sparse anymore, so this would need careful optimization for the sparse case (that may be costly for the usual case, so it probably isn't optimized this way).
You can try ELKI (not python but Java) which often is much faster, and also has sparse data types. You can also try using single-precision float will also help.
But in the end, the results will be questionable: k-means is statistically rooted in least-squares. It assumes your data is coming from k signals plus some Gaussian error. Because your data is sparse, it obviously does not have this kind of Gaussian shape. When the majority of values is 0, it cannot be a Gaussian.
With just 1000 data points, I'd rather use HAC.
Upvotes: 3
Reputation: 2276
Although KMeans
accepts sparse matrices as input, the centroids used within the algorithm have a dense representation, and your feature space is so big that even 10 centroids will not fit into 16GB of RAM.
I have 2 ideas:
sklearn.cluster.SpectralClustering
. You could precompute the pairwise distances in a 1000x1000 matrix and pass that to your clustering algorithm in stead. (I can't make a specific recommendation of a clustering method, or a suitable function to calculate the distances, as it will depend on your application)Upvotes: 1
Reputation: 33532
Whatever you do (for your data; given your memory-constraints): kmeans is not ready for that!
This includes:
Ignoring potential theoretic reasons (high-dimensionality and non-convex heuristic optimization) i'm just mentioning the practical problem here:
center_shift_total = squared_norm(centers_old - centers)
Even if you remove / turn-off all the memory-heavy components like:
init=some_sparse_ndarray
(instead of k-means++
)
n_init=1
instead of 10
precompute_distances=False
instead of True
(unclear if it helps)
n_jobs=1
instead of -1
the above will be your problem to care!
Upvotes: 1
Reputation: 1121
Consider using dict
, since it will only store the values wich were assigned. I guess a nice way to do this is by creating a SparseMatrix
object like this:
class SparseMatrix(dict):
def __init__(self, mapping=[]):
dict.__init__(self, {i:mapping[i] for i in range(len(mapping))})
#overriding this method makes never-accessed indexes return 0.0
def __getitem__(self, i):
try:
return dict.__getitem__(self, i)
except KeyError:
return 0.0
>>> my_matrix = SparseMatrix([1,2,3])
>>> my_matrix[0]
1
>>> my_matrix[5]
0.0
Edit:
For the multi-dimensional case you may need to override the two item-management methods as follows:
def __getitem__(self, ij):
i,j = ij
dict.__setitem__(i*self.n + j)
def __getitem__(self, ij):
try:
i,j = ij
return dict.__getitem__(self, i*self.n + j)
except KeyError:
return 0.0
>>> my_matrix[0,0] = 10
>>> my_matrix[1,2]
0.0
>>> my_matrix[0,0]
10
Also assuming you defined self.n
as the length of the matrix rows.
Upvotes: -3