Reputation: 596
I would like get some basic understanding how to use biopython for clustering genes.
Lets say i have genes that i would like to group. How to feed them to the algorithm, and how to give a cutoff point under which size and amount of cluster would depend?
I've tried straightforward approach:
from Bio.Cluster import kcluster
list1 = [
'ADHAMKCAIROSURBANDJVUGLOBALIZATIONANDURBANFANTASIESPLA',
'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB'
]
list2 = [Seq(gen, IUPAC.extended_protein) for gen in list1]
clusterid, error, nfound = kcluster(list2)
but it just brought me an error:
Traceback (most recent call last):
File "./test.py", line 9, in <module>
clusterid, error, nfound = kcluster(list2)
TypeError: data cannot be converted to needed array.
Upvotes: 2
Views: 2449
Reputation: 1128
The kcluster
function takes a data matrix as input and not Seq
instances.
You need to convert your sequences to a matrix and provide that to the kcluster
function.
One way of converting the data to a matrix containing numerical elements only is by using the numpy.fromstring
function. It basically translates each letter in a sequence to it's ASCII counterpart.
This creates a 2D array of encoded sequences that the kcluster
function recognized and uses to cluster your sequences.
>>> from Bio.Cluster import kcluster
>>> import numpy as np
>>> sequences = [
... 'ADHAMKCAIROSURBANDJVUGLOBALIZATIONANDURBANFANTASIESPLA',
... 'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
... 'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
... 'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB'
... ]
>>> matrix = np.asarray([np.fromstring(s, dtype=np.uint8) for s in sequences])
>>> clusterid, error, nfound = kcluster(matrix)
>>> print(clusterid)
[1, 0, 0, 0]
Upvotes: 3