Reputation: 179
I am running this Selecting the number of clusters example of scikit-learn
in python
. the example gets several samples with 2 features and finds best k for kmeans
clustering.
In my case I have samples with 3 features. they are 3 dimensional coordinates
indeed. so, in the code I just change the input to my samples and the rest remains same. number of my sample points are very big maybe more than 10,000 points.
when I input all my data I got memory error (I have 16GB of RAM and all of it got full). But when I put half of my data it doesn't give the error. Although the error shows by ipython notebook for silhouette function but I am pretty sure it happens in kmeans and it doesn't perform clustering and jumps to this error suddenly.
With same amount of data I did kmeans clustering in C++
and it was totally fine and fast without any problem.
is there any idea how can I resolve this problem?
this is the error I got
MemoryError Traceback (most recent call last)
<ipython-input-4-ed4b060ccea1> in <module>()
41 # This gives a perspective into the density and separation of the formed
42 # clusters
---> 43 silhouette_avg = silhouette_score(X, cluster_labels)
44 print("For n_clusters =", n_clusters,
45 "The average silhouette_score is :", silhouette_avg)
/usr/lib64/python2.7/site-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
82 else:
83 X, labels = X[indices], labels[indices]
---> 84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
85
86
/usr/lib64/python2.7/site-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_samples(X, labels, metric, **kwds)
141
142 """
--> 143 distances = pairwise_distances(X, metric=metric, **kwds)
144 n = labels.shape[0]
145 A = np.array([_intra_cluster_distance(distances[i], labels, i)
/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.pyc in pairwise_distances(X, Y, metric, n_jobs, **kwds)
649 func = pairwise_distance_functions[metric]
650 if n_jobs == 1:
--> 651 return func(X, Y, **kwds)
652 else:
653 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.pyc in euclidean_distances(X, Y, Y_norm_squared, squared)
181 distances.flat[::distances.shape[0] + 1] = 0.0
182
--> 183 return distances if squared else np.sqrt(distances)
184
185
MemoryError:
Upvotes: 1
Views: 3104
Reputation: 77454
It's not k-means which runs out of memory.
But the Silhouette evaluation index needs quadratic distance computations, and apparently sklearn tries to do this by computing a distance matrix. Most likely, it will even need multiple copies of it.
Now, do the math yourself. Most implementations run out of memory at around 64k instances when trying to compute a full distance matrix.
Thus, remove the call to silhouette.
Upvotes: 2