Darius
Darius

Reputation: 596

Simple example of biopython clustering

I would like get some basic understanding how to use biopython for clustering genes.

Lets say i have genes that i would like to group. How to feed them to the algorithm, and how to give a cutoff point under which size and amount of cluster would depend?

I've tried straightforward approach:

from Bio.Cluster import kcluster
list1 = [
    'ADHAMKCAIROSURBANDJVUGLOBALIZATIONANDURBANFANTASIESPLA', 
    'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB', 
    'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB', 
    'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB'
]
list2 = [Seq(gen, IUPAC.extended_protein) for gen in list1]
clusterid, error, nfound = kcluster(list2)

but it just brought me an error:

Traceback (most recent call last):
  File "./test.py", line 9, in <module>
    clusterid, error, nfound = kcluster(list2)
TypeError: data cannot be converted to needed array.

Upvotes: 2

Views: 2449

Answers (1)

fsimkovic
fsimkovic

Reputation: 1128

The kcluster function takes a data matrix as input and not Seq instances.

You need to convert your sequences to a matrix and provide that to the kcluster function.

One way of converting the data to a matrix containing numerical elements only is by using the numpy.fromstring function. It basically translates each letter in a sequence to it's ASCII counterpart.

This creates a 2D array of encoded sequences that the kcluster function recognized and uses to cluster your sequences.

>>> from Bio.Cluster import kcluster
>>> import numpy as np
>>> sequences = [
...     'ADHAMKCAIROSURBANDJVUGLOBALIZATIONANDURBANFANTASIESPLA',
...     'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
...     'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
...     'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB'
... ]
>>> matrix = np.asarray([np.fromstring(s, dtype=np.uint8) for s in sequences])
>>> clusterid, error, nfound = kcluster(matrix)
>>> print(clusterid)
[1, 0, 0, 0]

Upvotes: 3

Related Questions