Reputation: 61
I have a fairly large data set (1841000*32 matrix) I wish to run a hierarchical clustering algorithm on. Both the AgglomerativeClustering class and the FeatureAgglomeration class in sklearn.cluster give the below error.
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-10-85ab7b694cf1> in <module>()
1
2
----> 3 mat_red = manifold.SpectralEmbedding(n_components=2).fit_transform(mat)
4 clustering.fit(mat_red,y = None)
~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in fit_transform(self, X, y)
525 X_new : array-like, shape (n_samples, n_components)
526 """
--> 527 self.fit(X)
528 return self.embedding_
~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in fit(self, X, y)
498 "name or a callable. Got: %s") % self.affinity)
499
--> 500 affinity_matrix = self._get_affinity_matrix(X)
501 self.embedding_ = spectral_embedding(affinity_matrix,
502 n_components=self.n_components,
~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in _get_affinity_matrix(self, X, Y)
450 self.affinity_matrix_ = kneighbors_graph(X, self.n_neighbors_,
451 include_self=True,
--> 452 n_jobs=self.n_jobs)
453 # currently only symmetric affinity_matrix supported
454 self.affinity_matrix_ = 0.5 * (self.affinity_matrix_ +
~/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/graph.py in kneighbors_graph(X, n_neighbors, mode, metric, p, metric_params, include_self, n_jobs)
101
102 query = _query_include_self(X, include_self)
--> 103 return X.kneighbors_graph(X=query, n_neighbors=n_neighbors, mode=mode)
104
105
~/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors_graph(self, X, n_neighbors, mode)
482 # construct CSR matrix representation of the k-NN graph
483 if mode == 'connectivity':
--> 484 A_data = np.ones(n_samples1 * n_neighbors)
485 A_ind = self.kneighbors(X, n_neighbors, return_distance=False)
486
~/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py in ones(shape, dtype, order)
186
187 """
--> 188 a = empty(shape, dtype, order)
189 multiarray.copyto(a, 1, casting='unsafe')
190 return a
MemoryError:
My RAM is 8GB, and the same error occurred when i ran it on a 64GB system. I realize hierarchical clustering is computationally expensive, and not recommended for large datasets, but I need to create a dendrogram of all my data at once. I am creating a vocabulary tree from a bag of visual words using ORB features. If there is any other way to achieve this or a way to fix the error, please illuminate! Thank you.
Upvotes: 5
Views: 4355
Reputation: 340
I ran into a similar issue running agglomerative clustering. My solution was to run the clustering algorithm on a small subset of the data using train_test_split, then use KNN to extend the labels from AC to the rest of the data. Works reasonably well, not sure if the data you are using is amenable to that treatment or not. My code for extending is:
X_train, X_test, y_train, y_test = \
train_test_split(X, y,
test_size=test_size, random_state=42)
AC = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
AC.fit(X_train)
labels = AC.labels_
KN = KNeighborsClassifier(n_neighbors=n_neighbors)
KN.fit(X_train,labels)
labels2 = KN.predict(X)
Upvotes: 5