Aephir
Aephir

Reputation: 167

In python (pandas.DataFrame), is there an easy/efficient way to create all possible combinations of one column from each index, scoring by value?

It was difficult to describe well in the title, so let me give an example. I have the following DataFrame:

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   X
0   0   0   0   0   0   0   0   0   0   0   0   0   0   102 0   0   0   0   0   0   4
1   12  0   0   0   0   0   0   1   0   79  0   0   0   0   0   0   2   8   0   0   4
2   0   0   0   0   2   0   37  0   0   0   0   3   1   0   2   0   0   0   0   57  4
3   3   0   1   55  0   0   0   6   2   0   1   3   0   0   0   2   8   18  0   0   7
4   5   0   0   0   0   0   0   0   1   0   0   77  0   0   0   6   13  0   0   0   4
5   0   0   0   0   0   0   0   0   102 0   0   0   0   0   0   0   0   0   0   0   4
6   25  0   0   0   0   0   0   0   0   0   0   0   52  0   0   18  7   0   0   0   4
7   0   0   0   0   0   0   0   0   0   0   0   0   0   0   102 0   0   0   0   0   4
8   0   0   0   0   0   0   0   0   0   0   0   0   0   1   101 0   0   0   0   0   4
9   0   0   0   0   0   0   0   0   0   0   0   0   102 0   0   0   0   0   0   0   4
10  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   102 4
11  0   0   0   0   0   0   0   102 0   0   0   0   0   0   0   0   0   0   0   0   4
12  0   0   0   0   0   0   0   0   0   102 0   0   0   0   0   0   0   0   0   0   4

where the index are the position of the sequence (amino acids). The columns are amino acids (with X denoting gaps). The value is a score of probability.

What I want it to get a series/dictionary/dataframe/whatever with each possible sequence and a corresponding "total score" for the sequence. In this case, every possible combination of the letters in column headers that is 13 long, together with the total score (values added up). So in sequences where a Q is the first letter, add 102 to the total score, if T is the third, add 8 to the total score, etc. So for this case QLYENKPRRPYIL is the combination that gives the highest total score (1135), number 2 is QLHENKPRRPYIL with a total score of 1115, etc.

I can do this, but with a bunch of loops and conditions (calculating the difference between the highest, next highest, etc. values in each row), but I assume this is likely the worst possible of several ways to do this.

So, is there a more efficient way to do this, using e.g. a pandas or numpy method I haven't been able to find (or at least figure out how to use for this)?

Upvotes: 1

Views: 100

Answers (1)

yann ziselman
yann ziselman

Reputation: 2002

I would do smth like I show in the code below. Sorry for the oversimplified example. A bigger example would be too difficult for me to visualize. Instead of the amino acid names, I use their indices. But you can quickly interpret the peptide by using a list of their names.

import numpy as np

np.random.seed(0)
LENGTH = 2 # The length of each peptide
AA_NUM = 3 # the number of unique amino acids (including gap)
df = np.random.randint(0, 10, (LENGTH, AA_NUM))
print(f'df = \n{df}')

# create all possible peptides of length LENGTH:
grids   = (np.arange(AA_NUM*LENGTH)%AA_NUM).reshape(LENGTH, AA_NUM)
allpeps = np.stack(np.meshgrid(*grids))
allpeps = allpeps.transpose(*np.arange(1, LENGTH+1), 0)

allpeps = allpeps.reshape(-1, LENGTH)
print(f'allpeps = \n{allpeps}')

# compute the score of each peptide:
length_iter = np.arange(allpeps.size)%LENGTH
scores = df[length_iter, allpeps.flatten()].reshape(-1, LENGTH).sum(1)
print(f'scores = \n{scores}')

# sort the peptides by score:
order = np.argsort(scores)
print(f'order = \n{order}')
print(f'worst peptide: = {allpeps[order[0]]}; score: {scores[order[0]]}')
print(f'best peptide: = {allpeps[order[-1]]}; score: {scores[order[-1]]}')

output:

df = 
[[5 0 3]
 [3 7 9]]
allpeps = 
[[0 0]
 [1 0]
 [2 0]
 [0 1]
 [1 1]
 [2 1]
 [0 2]
 [1 2]
 [2 2]]
scores = 
[ 8  3  6 12  7 10 14  9 12]
order = 
[1 2 4 0 7 5 3 8 6]
worst peptide: = [1 0]; score: 3
best peptide: = [0 2]; score: 14

Upvotes: 1

Related Questions