jwillis0720
jwillis0720

Reputation: 4477

Hierarchical clustering a pairwise distance matrix of precomputed distances

I have a pairwise distance dataframe that I've made with pandas:

#Get files
import glob
import itertools
one_dimension = glob.glob('*.pdb')

dataframe = []
for combo in itertools.combinations(one_dimension,2):
    pdb_1 = combo[0]
    pdb_2 = combo[1]
    entry = { 'pdb_1' : pdb_1, 'pdb_2', 'rmsd': get_rmsd(pdb_1,pdb_2)
    dataframe.append(entry)

import pandas
dataframe = Dataframe(dataframe)
dataframe

enter image description here

All I want to do is cluster the dataframe in such a way where all clusters contain pdbs that are less than some cutoff ( lets say less than 2). I have read that complete linkage is the way to go.

For instance:

  1. pdb_1,pdb_2 have an rmsd 1.56
  2. pdb_3,pdb_2 have an rmsd 1.03
  3. pdb_2, pdb_1 have an rmsd of 1.60

So they are can all appear in a cluster together. But if any new pdb tries to be added to the cluster, if it is > 2 for any member already in the cluster, it will be rejected.

I understand that this is a complete linkage with a cutoff.

I have looked into scipy.cluster.hierarchy.linkage, but I'm having an extremely hard time formatting the array to enter into the linkage.

I have found this, this, and this question similar, and found this tutorial

UPDATE

according to the answer by cel, I can get the following:

>>df

enter image description here

and then pivot

 pivot_table = df.pivot('pdb_1','pdb_2','rmsd').fillna(0)
 >>pivot_table

enter image description here

Then the data array

piv_arr = pivot_table.as_matrix()
dist_mat = piv_arr + np.transpose(piv_arr)
>>dist_mat

enter image description here

But, I can't make a squareform as the diagnals don't equal 0...

>>>squareform(dist_mat)

enter image description here

and can verify

>>dist_mat.diagonal()

enter image description here

Upvotes: 2

Views: 7174

Answers (1)

cel
cel

Reputation: 31349

This might work for you:

These are the imports we need:

import scipy.cluster.hierarchy as hcl
from scipy.spatial.distance import squareform
import pandas as pd
import numpy as np

Let's assume we already calculated the distance matrix and decided to store the upper triangular part of the distance matrix in this format:

data = pd.DataFrame({
    "a": ["a1", "a1", "a2", "a3", "a2", "a1"],
    "b": ["a2", "a3", "a3", "a3", "a2", "a1"],
    "distance": [1,2,3, 0, 0, 0]
})

So this is our data frame:

a   b   distance
0   a1  a2  1
1   a1  a3  2
2   a2  a3  3
3   a3  a3  0
4   a2  a2  0
5   a1  a1  0

Using DataFrame.pivot, we can convert the data frame to a square distance matrix:

data_piv = data.pivot("a", "b", "distance").fillna(0)
piv_arr = data_piv.as_matrix()
dist_mat = piv_arr + np.transpose(piv_arr)

This will give us:

array([[ 0.,  1.,  2.],
       [ 1.,  0.,  3.],
       [ 2.,  3.,  0.]])

This we can transform into a condensed distance matrix via squareform and feed into the linkage algorithm:

hcl.linkage(squareform(dist_mat))

Which gives us following linkage matrix:

array([[ 0.,  1.,  1.,  2.],
       [ 2.,  3.,  2.,  3.]])

Upvotes: 1

Related Questions