Rajdeep Jaswal
Rajdeep Jaswal

Reputation: 11

Finding protein motifs and its position in Python

I am a newbie in Python learning. I want to identify a motif sequence in a large protein data set. Using the one-line code mentioned below, I was able to identify proteins that I am interested in. However, I also want the start and end position of the motif in these proteins. It will be helpful if someone can suggest what additional arguments I have to use along with the below-mentioned code. thank you in advance.

import re
df.loc[df ['Protein_sequence'].str.contains ("WA[T]R",regex=True)]

Protein_name    Protein_sequence
242 >PST130_487694  MLRFFRLAALVLLMTSWEVAGDTYDPKTKTTYFGCHKNVDAVCSEP...
358 >Pucstr1_10722  MLRFFRSIALVWLMASWEVSTAGKYPNNPDPVNGAKYFGCHKNVDA...
475 >Pucstr1_2774   MLRFLILTALVLLVASWQVTDTLSQDPGDILFWCHKNVDAVCSETI...

Upvotes: 1

Views: 199

Answers (1)

mozway
mozway

Reputation: 262214

One option using re.search:

import re

pat = re.compile('WAT?R')

out = df.join(pd.DataFrame([m.span(0) if (m:=pat.search(x)) else (pd.NA,)*2
                            for x in df['Protein_sequence']],
                            index=df.index, columns=['start', 'end'])
             ).dropna(subset=['start', 'end'], how='all')

Output (on a modified input):

       Protein_name                                   Protein_sequence start end
242  >PST130_487694  MLRFFRLAALVLLMTSWARAGDTYDPKTKTTYFGCHKNVDAVCSEP...    16  19
358  >Pucstr1_10722  MLRFFRSIALVWATRSWEVSTAGKYPNNPDPVNGAKYFGCHKNVDA...    11  15

Upvotes: 0

Related Questions