Reputation: 23
I'm using the validity index in the hdbscan package, which implements DBCV score according to the following paper: https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
I'm working on a face clustering project, and after using the validity index it prompts an error.
Here is the code:
dbcv_score_output = hdbscan.validity.validity_index(feature_vectors, archive_labels)
dbcv_score_output
The full error:
hdbscan/validity.py:30: RuntimeWarning: overflow encountered in power
distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/anaconda3/lib/python3.9/site-packages/hdbscan/validity.py:371, in validity_index(X, labels, metric, d, per_cluster_scores, mst_raw_dist, verbose, **kwd_args)
356 continue
358 distances_for_mst, core_distances[
359 cluster_id] = distances_between_points(
360 X,
(...)
367 **kwd_args
368 )
370 mst_nodes[cluster_id], mst_edges[cluster_id] = \
--> 371 internal_minimum_spanning_tree(distances_for_mst)
372 density_sparseness[cluster_id] = mst_edges[cluster_id].T[2].max()
374 for i in range(max_cluster_id):
File ~/anaconda3/lib/python3.9/site-packages/hdbscan/validity.py:165, in internal_minimum_spanning_tree(mr_distances)
136 def internal_minimum_spanning_tree(mr_distances):
137 """
138 Compute the 'internal' minimum spanning tree given a matrix of mutual
139 reachability distances. Given a minimum spanning tree the 'internal'
(...)
...
167 for index, row in enumerate(min_span_tree[1:], 1):
File hdbscan/_hdbscan_linkage.pyx:15, in hdbscan._hdbscan_linkage.mst_linkage_core()
ValueError: Buffer dtype mismatch, expected 'double_t' but got 'float'
A quick look at the inputs and its types:
The features:
dtype=float32
shape: (70201, 320)
The archives/clusters (it is label encoded):
shape: (70201,)
When I tried to change the features type to double/float64, it showed a different kind of error:
hdbscan/validity.py:33: RuntimeWarning: invalid value encountered in true_divide
result /= distance_matrix.shape[0] - 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/anaconda3/lib/python3.9/site-packages/hdbscan/validity.py:372, in validity_index(X, labels, metric, d, per_cluster_scores, mst_raw_dist, verbose, **kwd_args)
358 distances_for_mst, core_distances[
359 cluster_id] = distances_between_points(
360 X,
(...)
367 **kwd_args
368 )
370 mst_nodes[cluster_id], mst_edges[cluster_id] = \
371 internal_minimum_spanning_tree(distances_for_mst)
--> 372 density_sparseness[cluster_id] = mst_edges[cluster_id].T[2].max()
374 for i in range(max_cluster_id):
376 if np.sum(labels == i) == 0:
File ~/anaconda3/lib/python3.9/site-packages/numpy/core/_methods.py:40, in _amax(a, axis, out, keepdims, initial, where)
38 def _amax(a, axis=None, out=None, keepdims=False,
39 initial=_NoValue, where=True):
---> 40 return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
I went through all the related issues and fixes in the repo but with no avail. Are there any recommendations or fixes?
Upvotes: 1
Views: 199
Reputation: 11
I fixed that issue by converting np.array from float to double. In your case try to use:
feature_vectors=feature_vectors.astype('double')
before call validity_index.
Upvotes: 0