Reputation: 1
I'm tryinng to fit hierarchical clustering on a 23-dimensional dataset of 100.000 objects. How to solve the follwing error?
>>>ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')
>>>k=hf.features_itter(hf.file)
>>>k
array([[49, 0, 3, ..., 0, 0, 3],
[39, 1, 4, ..., 0, 0, 3],
[25, 0, 3, ..., 0, 0, 1],
...,
[21, 0, 6, ..., 0, 0, 1],
[47, 0, 8, ..., 0, 0, 2],
[28, 1, 2, ..., 0, 1, 3]], dtype=uint8)
>>>res = ac.fit_predict(k)
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
hierarchical()
File "C:\Users\Tolis\Downloads\WPy-3670\notebooks\ergasia\clustering.py", line 39, in hierarchical
ac.fit_predict(k)
File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\base.py", line 355, in fit_predict
self.fit(X)
File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\cluster\hierarchical.py", line 830, in fit
**kwargs)
File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\externals\joblib\memory.py", line 329, in __call__
return self.func(*args, **kwargs)
File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\cluster\hierarchical.py", line 584, in _complete_linkage
return linkage_tree(*args, **kwargs)
File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\sklearn\cluster\hierarchical.py", line 470, in linkage_tree
out = hierarchy.linkage(X, method=linkage, metric=affinity)
File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\scipy\cluster\hierarchy.py", line 708, in linkage
y = distance.pdist(y, metric)
File "C:\Users\Tolis\Downloads\WPy-3670\python-3.6.7\lib\site-packages\scipy\spatial\distance.py", line 1877, in pdist
dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
ValueError: Maximum allowed dimension exceeded
ValueError: Maximum allowed dimension exceeded
Upvotes: 0
Views: 1625
Reputation: 1
Thanx for the answer! I had to use hierarchical clustering because that was the case of study, so i followed the solution described at link
Upvotes: 0
Reputation: 3790
I guess - there's no elegant solution of this issue using Agglomerative clustering because of some properties of this algorithm. You measure distances between all pairs of objects while the function
y = distance.pdist(y, metric)
is invoked inside AgglomerativeClustering
.
So, AgglomerativeClustering
algorithm does not fit well for big or even medium-size datasets:
The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of O(n^3) and requires O(n^2) memory, which makes it too slow for even medium data sets.
- because it's slow, and, also, there's O(n^2)
memory. Even if the algorithm uses RAM in optimal way, the matrix of pairwise distances consumes ~ 1e10 * 4
bytes (~40Gb) оf memory - because each float32
value consumes 4 bytes and there are 10.000 * 10.000
of such measurements. Probably there's not enough memory.
(I've tested pairwise distance for 100.000 random points with ~100Gb RAM, and it computes soooo long - althought haven't failed)
Also, it will run for very long time - because of it's O(n^3)
time complexity.
I suggest you to try sklearn.cluster.DBSCAN
- it has similar behaviour for some data (sklearn examples), also, it runs a way faster and consumes much less memory:
DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
Memory consumption:
This implementation bulk-computes all neighborhood queries, which increases the memory complexity to O(n.d) where d is the average number of neighbors, while original DBSCAN had memory complexity O(n). It may attract a higher memory complexity when querying these nearest neighborhoods, depending on the algorithm
Time complexity: average O(n log n)
, but depends on implementation, worst-case O(n^2)
- a way better than O(n^3)
for agglomerative.
Check this clustering algorithm, probably it will give nice results. The main problem is that DBSCAN defines the number of cluster automatically, so you can't set it to 2.
Upvotes: 2