user1603472
user1603472

Reputation: 1458

How do you access ward/centroid/median clustering in scipy?

When using scipy.spatial.distance.pdist to create a condensed distance matrix and passing it to ward and I get this error:

Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average' error. 

The documentation though says that the linkage() function expects a condensed distance matrix. How can I work around this problem?

foo = np.random.randint(3, size=(10,10))
scipy.spatial.distance.pdist(foo)
scipy.cluster.hierarchy.linkage(foo)
bar = scipy.spatial.distance.pdist(foo)
scipy.cluster.hierarchy.linkage(bar, method='ward')

gives:

 Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/dist-packages/scipy /cluster/hierarchy.py", line 627, in linkage
raise ValueError("Valid methods when the raw observations are "
 ValueError: Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'. 

I searched a bit and found this link, indicating that a few other people have the problem, but I was unable to find a workaround to provide the data in a form that scipy will accept.

Upvotes: 3

Views: 2493

Answers (2)

XY.W
XY.W

Reputation: 104

scipy.cluster.hierarchy.linkage(y, method) returns correct results for single, complete, average, weighted when y is either a distance matrix or a data matrix. But for centroid, median and ward methods, y has to be a data matrix, error occurs if y is a distance matrix. I agree that the documentation isn't clear.

from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import pdist

inp = np.loadtxt('iris.txt',delimiter=",", usecols=(0,1,2,3))
x = np.asarray(inp)
Y = pdist(x,'euclidean')
res_linkage = linkage(x,"weighted")`

You can test the code above by inputing x a data matrix, or Y an Euclidean distance matrix into the linkage() function.

I also found out that compared to the equivalent implementation in R, hclust package, scipy.cluster.hierarchy.linkage returns different restuls for centroid, median and ward methods. It seems that scipy.cluster.hierarchy.linkage contains some errors when updating the distance of a newly merged cluster with an existing cluster.

Upvotes: 1

ali_m
ali_m

Reputation: 74182

From the docstring:

y : ndarray

A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array.

Passing in your original observations x dimensions array foo seems to work:

scipy.cluster.hierarchy.linkage(foo, method='ward')

gives:

array([[  1.        ,   2.        ,   2.23606798,   2.        ],
       [  5.        ,   8.        ,   2.23606798,   2.        ],
       [  3.        ,   7.        ,   2.64575131,   2.        ],
       [  9.        ,  11.        ,   2.64575131,   3.        ],
       [  0.        ,  10.        ,   3.31662479,   3.        ],
       [ 12.        ,  13.        ,   3.71483512,   5.        ],
       [  6.        ,  14.        ,   4.12310563,   4.        ],
       [  4.        ,  16.        ,   4.17133072,   5.        ],
       [ 15.        ,  17.        ,   5.5136195 ,  10.        ]])

I agree that the documentation for linkage() could do with improvement at the very least.

Upvotes: 2

Related Questions