Reputation: 302
I wrote the following code for hierarchical clustering, but I get the following error, can you help me?
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the Mall dataset with pandas
dataset =
pd.read_csv("https://raw.githubusercontent.com/akbarhusnoo/Chronic-Kidney-Disease-Prediction/main/chronic_kidney_disease.csv", na_values=["?"])
catCols = dataset.select_dtypes("object").columns
catCols = list(set(catCols))
for i in catCols:
dataset.replace({i: {'?': np.nan}}, regex=False,inplace=True)
dataset.dropna(how='all')
X = dataset.iloc[:, [3,4]].values
# Using the dendrogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method='ward' ))
plt.title('Dendrogram')
plt.xlabel('C')
plt.ylabel('Euclidean distances')
plt.show()
# Fitting the hierarchical clustering to the mall dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters=5, affinity = 'euclidean', linkage = 'ward')
Y_hc = hc.fit_predict(X)
# Visualising the clusters
**ValueError Traceback (most recent call last)
<ipython-input-30-2c6a60c0a6d0> in <module>
12
13 import scipy.cluster.hierarchy as sch
---> 14 dendrogram = sch.dendrogram(sch.linkage(X, method='ward' ))
15 plt.title('Dendrogram')
16 plt.xlabel('C')
~\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in linkage(y, method, metric, optimal_ordering)
1063
1064 if not np.all(np.isfinite(y)):
-> 1065 raise ValueError("The condensed distance matrix must contain only "
1066 "finite values.")
1067
ValueError: The condensed distance matrix must contain only finite values.*
Upvotes: 0
Views: 19820
Reputation: 11
Try using different linkage method instead of 'ward' (e.g. ‘single’, ‘complete’, ‘average’ or ‘weighted’)
---> 14 dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))
The Ward calculation might be resulting in infs or nans...
Upvotes: 1
Reputation: 12201
There are question marks in your input dataset, which result in the dataset values being read/interpreted as strings instead of integers.
You should either convert the question marks to NaNs after reading the CSV, or remove them directly from the input CSV file (leaving an empty cell in the CSV will be interpreted as a NaN, so replacing all ,?,
by ,,
could very well work).
Once you've done that, you can drop rows with NaNs. Be aware that
dropna(how='any')
, not dropna(how='all')
, to make sure such rows are also dropped.dropna()
by default does not work in-place (as is the default for most operations in Pandas in current versions). Assign the result to a dataset, or use the inplace=True
argument.Thus, use
dataset = dataset.dropna('any')
when removing the rows with NaNs.
Upvotes: 2