Reputation: 1458
When using scipy.spatial.distance.pdist
to create a condensed distance matrix and passing it to ward
and I get this error:
Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average' error.
The documentation though says that the linkage()
function expects a condensed distance matrix. How can I work around this problem?
foo = np.random.randint(3, size=(10,10))
scipy.spatial.distance.pdist(foo)
scipy.cluster.hierarchy.linkage(foo)
bar = scipy.spatial.distance.pdist(foo)
scipy.cluster.hierarchy.linkage(bar, method='ward')
gives:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/scipy /cluster/hierarchy.py", line 627, in linkage
raise ValueError("Valid methods when the raw observations are "
ValueError: Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'.
I searched a bit and found this link, indicating that a few other people have the problem, but I was unable to find a workaround to provide the data in a form that scipy will accept.
Upvotes: 3
Views: 2493
Reputation: 104
scipy.cluster.hierarchy.linkage(y, method)
returns correct results for single, complete, average, weighted when y is either a distance matrix or a data matrix. But for centroid, median and ward methods, y has to be a data matrix, error occurs if y is a distance matrix. I agree that the documentation isn't clear.
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import pdist
inp = np.loadtxt('iris.txt',delimiter=",", usecols=(0,1,2,3))
x = np.asarray(inp)
Y = pdist(x,'euclidean')
res_linkage = linkage(x,"weighted")`
You can test the code above by inputing x a data matrix, or Y an Euclidean distance matrix into the linkage() function.
I also found out that compared to the equivalent implementation in R, hclust
package, scipy.cluster.hierarchy.linkage
returns different restuls for centroid, median and ward methods. It seems that scipy.cluster.hierarchy.linkage
contains some errors when updating the distance of a newly merged cluster with an existing cluster.
Upvotes: 1
Reputation: 74182
From the docstring:
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array.
Passing in your original observations x dimensions array foo
seems to work:
scipy.cluster.hierarchy.linkage(foo, method='ward')
gives:
array([[ 1. , 2. , 2.23606798, 2. ],
[ 5. , 8. , 2.23606798, 2. ],
[ 3. , 7. , 2.64575131, 2. ],
[ 9. , 11. , 2.64575131, 3. ],
[ 0. , 10. , 3.31662479, 3. ],
[ 12. , 13. , 3.71483512, 5. ],
[ 6. , 14. , 4.12310563, 4. ],
[ 4. , 16. , 4.17133072, 5. ],
[ 15. , 17. , 5.5136195 , 10. ]])
I agree that the documentation for linkage()
could do with improvement at the very least.
Upvotes: 2