user667222
user667222

Reputation: 179

python memory error for kmeans in scikit-learn

I am running this Selecting the number of clusters example of scikit-learn in python. the example gets several samples with 2 features and finds best k for kmeans clustering.

In my case I have samples with 3 features. they are 3 dimensional coordinates indeed. so, in the code I just change the input to my samples and the rest remains same. number of my sample points are very big maybe more than 10,000 points.

when I input all my data I got memory error (I have 16GB of RAM and all of it got full). But when I put half of my data it doesn't give the error. Although the error shows by ipython notebook for silhouette function but I am pretty sure it happens in kmeans and it doesn't perform clustering and jumps to this error suddenly.

With same amount of data I did kmeans clustering in C++ and it was totally fine and fast without any problem. is there any idea how can I resolve this problem? this is the error I got

         MemoryError              Traceback (most recent call last)
        <ipython-input-4-ed4b060ccea1> in <module>()
 41     # This gives a perspective into the density and separation of the formed
 42     # clusters
---> 43     silhouette_avg = silhouette_score(X, cluster_labels)
 44     print("For n_clusters =", n_clusters,
 45           "The average silhouette_score is :", silhouette_avg)

/usr/lib64/python2.7/site-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
 82         else:
 83             X, labels = X[indices], labels[indices]
---> 84     return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
 85 
 86 

  /usr/lib64/python2.7/site-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_samples(X, labels, metric, **kwds)
141 
142     """
 --> 143     distances = pairwise_distances(X, metric=metric, **kwds)
144     n = labels.shape[0]
145     A = np.array([_intra_cluster_distance(distances[i], labels, i)

 /usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.pyc in pairwise_distances(X, Y, metric, n_jobs, **kwds)
649         func = pairwise_distance_functions[metric]
650         if n_jobs == 1:
--> 651             return func(X, Y, **kwds)
652         else:
653             return _parallel_pairwise(X, Y, func, n_jobs, **kwds)

 /usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.pyc in euclidean_distances(X, Y, Y_norm_squared, squared)
181         distances.flat[::distances.shape[0] + 1] = 0.0
182 
--> 183     return distances if squared else np.sqrt(distances)
184 
185 

MemoryError: 

Upvotes: 1

Views: 3104

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

It's not k-means which runs out of memory.

But the Silhouette evaluation index needs quadratic distance computations, and apparently sklearn tries to do this by computing a distance matrix. Most likely, it will even need multiple copies of it.

Now, do the math yourself. Most implementations run out of memory at around 64k instances when trying to compute a full distance matrix.

Thus, remove the call to silhouette.

Upvotes: 2

Related Questions