Reputation: 4477
I have a pairwise distance dataframe that I've made with pandas:
#Get files
import glob
import itertools
one_dimension = glob.glob('*.pdb')
dataframe = []
for combo in itertools.combinations(one_dimension,2):
pdb_1 = combo[0]
pdb_2 = combo[1]
entry = { 'pdb_1' : pdb_1, 'pdb_2', 'rmsd': get_rmsd(pdb_1,pdb_2)
dataframe.append(entry)
import pandas
dataframe = Dataframe(dataframe)
dataframe
All I want to do is cluster the dataframe in such a way where all clusters contain pdbs that are less than some cutoff ( lets say less than 2). I have read that complete linkage is the way to go.
For instance:
So they are can all appear in a cluster together. But if any new pdb tries to be added to the cluster, if it is > 2 for any member already in the cluster, it will be rejected.
I understand that this is a complete linkage with a cutoff.
I have looked into scipy.cluster.hierarchy.linkage, but I'm having an extremely hard time formatting the array to enter into the linkage.
What is the best way to complete this task?
How do I go from my dataframe to something that can be useable by
scipy.cluster?
Should I turn it into an R dataframe?
How do I find out which members are in the cluster if I transform the pairwise distance to an array.
I have found this, this, and this question similar, and found this tutorial
according to the answer by cel, I can get the following:
>>df
and then pivot
pivot_table = df.pivot('pdb_1','pdb_2','rmsd').fillna(0)
>>pivot_table
Then the data array
piv_arr = pivot_table.as_matrix()
dist_mat = piv_arr + np.transpose(piv_arr)
>>dist_mat
But, I can't make a squareform as the diagnals don't equal 0...
>>>squareform(dist_mat)
and can verify
>>dist_mat.diagonal()
Upvotes: 2
Views: 7174
Reputation: 31349
This might work for you:
These are the imports we need:
import scipy.cluster.hierarchy as hcl
from scipy.spatial.distance import squareform
import pandas as pd
import numpy as np
Let's assume we already calculated the distance matrix and decided to store the upper triangular part of the distance matrix in this format:
data = pd.DataFrame({
"a": ["a1", "a1", "a2", "a3", "a2", "a1"],
"b": ["a2", "a3", "a3", "a3", "a2", "a1"],
"distance": [1,2,3, 0, 0, 0]
})
So this is our data frame:
a b distance
0 a1 a2 1
1 a1 a3 2
2 a2 a3 3
3 a3 a3 0
4 a2 a2 0
5 a1 a1 0
Using DataFrame.pivot
, we can convert the data frame to a square distance matrix:
data_piv = data.pivot("a", "b", "distance").fillna(0)
piv_arr = data_piv.as_matrix()
dist_mat = piv_arr + np.transpose(piv_arr)
This will give us:
array([[ 0., 1., 2.],
[ 1., 0., 3.],
[ 2., 3., 0.]])
This we can transform into a condensed distance matrix via squareform
and feed into the linkage algorithm:
hcl.linkage(squareform(dist_mat))
Which gives us following linkage matrix:
array([[ 0., 1., 1., 2.],
[ 2., 3., 2., 3.]])
Upvotes: 1