Reputation: 25
I want to build a consensus sequence from several sequences in python and I'm looking for the most efficient / most pythonic way to achieve this.
I have a list of strings like this:
sequences = ["ACTAG", "-TTCG", "CTTAG"]
I furthermore have an alphabet like this:
alphabet = ["A", "C", "G", "T"]
and a position frequency matrix like this:
[A C G T]
1 1 0 0
0 1 0 2
0 0 0 3
2 1 0 0
0 0 3 0
If a character occurrs the most at a position, this character is taken for the consensus sequence.
Additionally, when 2 or more characters have the same occurrences for the same position there are additional characters (in this example at position 0 => A or C = M, see IUPAC Codes)
The expected consensus sequence for my example is therefore "MTTAG".
EDIT:
What is the most efficient / most pythonic way to get this consensus sequence based on the given alphabet and position frequency matrix?
Upvotes: 1
Views: 1254
Reputation: 3599
If you already have the position frequency matrix, you could process it as a pandas DataFrame. I chose to orient it such that the alphabet is the index (note the transpose
call at the end):
freq = pd.DataFrame([[1, 1, 0, 0], [0, 1, 0, 2], [0, 0, 0, 3], [2, 1, 0, 0], [0, 0, 3, 0]], columns=['A', 'C', 'G', 'T']).transpose()
gives
0 1 2 3 4
A 1 0 0 2 0
C 1 1 0 1 0
G 0 0 0 0 3
T 0 2 3 0 0
You'll want to look only at the most common nucleotides:
most_common = freq[freq == freq.max(axis=0)]
gives
0 1 2 3 4
A 1.0 NaN NaN 2.0 NaN
C 1.0 NaN NaN NaN NaN
G NaN NaN NaN NaN 3.0
T NaN 2.0 3.0 NaN NaN
Then create a function that determines the consensus from a single column of the above matrix, based on IUPAC codes:
codes = {
'A': 'A', 'C': 'C', 'G': 'G', 'T': 'T',
'AG': 'R', 'CT': 'Y', 'CG': 'S', 'AT': 'W', 'GT': 'K', 'AC': 'M',
'CGT': 'B', 'AGT': 'D', 'ACT': 'H', 'ACG': 'V',
'ACGT': 'N'
}
def freq_to_code(pos):
top_nucs = pos.dropna().index
key = ''.join(sorted(top_nucs))
return codes[key]
Apply that function to each column and form a string to get the final result:
consensus = most_common.apply(freq_to_code, axis=0)
print(''.join(consensus))
gives MTTAG
Upvotes: 3