Reputation:
How can I run hierarchical clustering on a correlation matrix in scipy
/numpy
? I have a matrix of 100 rows by 9 columns, and I'd like to hierarchically cluster by correlations of each entry across the 9 conditions. I'd like to use 1-pearson correlation as the distances for clustering. Assuming I have a numpy
array X
that contains the 100 x 9 matrix, how can I do this?
I tried using hcluster, based on this example:
Y=pdist(X, 'seuclidean')
Z=linkage(Y, 'single')
dendrogram(Z, color_threshold=0)
However, pdist
is not what I want, since that's a euclidean distance. Any ideas?
thanks.
Upvotes: 13
Views: 14586
Reputation: 379
I find it helpful to perform and visualize the hierarchical clustering using the seaborn clustermap (which uses underneath scipy for the clustering), after having used 'correlation' as a metric for pdist:
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import pdist, squareform
D = squareform(pdist(X.T, 'correlation'))
h = sns.clustermap(D, cmap='Reds')
You can also recover the corresponding linkage matrix and plot the dendrogram
Z = h.dendrogram_col.linkage
dendrogram(Z, color_threshold=0)
Upvotes: 1
Reputation: 47072
Just change the metric to correlation
so that the first line becomes:
Y=pdist(X, 'correlation')
However, I believe that the code can be simplified to just:
Z=linkage(X, 'single', 'correlation')
dendrogram(Z, color_threshold=0)
because linkage will take care of the pdist for you.
Upvotes: 14