ValueError: The condensed distance matrix must contain only finite values. in python

Question

I wrote the following code for hierarchical clustering, but I get the following error, can you help me?

# Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the Mall dataset with pandas

dataset = 

pd.read_csv("https://raw.githubusercontent.com/akbarhusnoo/Chronic-Kidney-Disease-Prediction/main/chronic_kidney_disease.csv", na_values=["?"])


catCols = dataset.select_dtypes("object").columns
catCols = list(set(catCols))
for i in catCols:
 dataset.replace({i: {'?': np.nan}}, regex=False,inplace=True)

dataset.dropna(how='all')
X = dataset.iloc[:, [3,4]].values

# Using the dendrogram to find the optimal number of clusters

import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method='ward' ))
plt.title('Dendrogram')
plt.xlabel('C')
plt.ylabel('Euclidean distances')
plt.show()

# Fitting the hierarchical clustering to the mall dataset

from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters=5, affinity = 'euclidean', linkage = 'ward')
Y_hc = hc.fit_predict(X)

# Visualising the clusters

dataset:https://raw.githubusercontent.com/akbarhusnoo/Chronic-Kidney-Disease-Prediction/main/chronic_kidney_disease.csv

**ValueError                                Traceback (most recent call last)
 in 
     12 
     13 import scipy.cluster.hierarchy as sch
---> 14 dendrogram = sch.dendrogram(sch.linkage(X, method='ward' ))
     15 plt.title('Dendrogram')
     16 plt.xlabel('C')

~\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in linkage(y, method, metric, optimal_ordering)
   1063 
   1064     if not np.all(np.isfinite(y)):
-> 1065         raise ValueError("The condensed distance matrix must contain only "
   1066                          "finite values.")
   1067 

ValueError: The condensed distance matrix must contain only finite values.*

9769953 · Accepted Answer

There are question marks in your input dataset, which result in the dataset values being read/interpreted as strings instead of integers.

You should either convert the question marks to NaNs after reading the CSV, or remove them directly from the input CSV file (leaving an empty cell in the CSV will be interpreted as a NaN, so replacing all ,?, by ,, could very well work).

Once you've done that, you can drop rows with NaNs. Be aware that

some rows only have a single column with a NaN. Use dropna(how='any'), not dropna(how='all'), to make sure such rows are also dropped.
dropna() by default does not work in-place (as is the default for most operations in Pandas in current versions). Assign the result to a dataset, or use the inplace=True argument.

Thus, use

dataset = dataset.dropna('any')

when removing the rows with NaNs.

ValueError: The condensed distance matrix must contain only finite values. in python

Answers (2)

Related Questions