AST
AST

Reputation: 137

Scoring for each row based on matrix in python

I have a matrix as follows

  0   1   2   3   ...
A 0.1 0.2 0.3 0.1
C 0.5 0.4 0.2 0.1
G 0.6 0.4 0.8 0.3
T 0.1 0.1 0.4 0.2

The data is in a dataframe as shown

Genes   string
Gene1   ATGC
Gene2   GCTA
Gene3   ATCG

I need to write a code to find the score of each sequence. The score for seq ATGC is 0.1+0.1+0.8+0.1 = 1.1 (A is 0.1 because A is in first position and the value for A at that position is 0.1, similar this is calculated along the length of the sequence (450 letters))

The output should be as follows:

Genes  Score
Gene1  1.1
Gene2  1.5
Gene3  0.7

I tried using biopython but could not get it right. Can anyone please help!

Upvotes: 0

Views: 443

Answers (1)

DYZ
DYZ

Reputation: 57085

Let df and genes be your DataFrames. First, let's convert df into a "tall" form:

tall = df.stack().reset_index()
tall.columns = 'letter', 'pos', 'score'
tall.pos = tall.pos.astype(int) # Need a number here, not a string!

Create a new tuple-based index for the trall DF:

tall.set_index(tall[['pos', 'letter']].apply(tuple, axis=1), inplace=True)

This function will extract the scores indexed by the tuples in the form (position,"letter") from the tall DF and sum them up:

def gene2score(gene):
  return tall.loc[list(enumerate(gene))]['score'].sum()

genes['string'].apply(gene2score)
#Genes
#Gene1    1.1
#Gene2    1.5
#Gene3    0.7

Upvotes: 3

Related Questions