Reputation: 11
I am a newbie in Python learning. I want to identify a motif sequence in a large protein data set. Using the one-line code mentioned below, I was able to identify proteins that I am interested in. However, I also want the start and end position of the motif in these proteins. It will be helpful if someone can suggest what additional arguments I have to use along with the below-mentioned code. thank you in advance.
import re
df.loc[df ['Protein_sequence'].str.contains ("WA[T]R",regex=True)]
Protein_name Protein_sequence
242 >PST130_487694 MLRFFRLAALVLLMTSWEVAGDTYDPKTKTTYFGCHKNVDAVCSEP...
358 >Pucstr1_10722 MLRFFRSIALVWLMASWEVSTAGKYPNNPDPVNGAKYFGCHKNVDA...
475 >Pucstr1_2774 MLRFLILTALVLLVASWQVTDTLSQDPGDILFWCHKNVDAVCSETI...
Upvotes: 1
Views: 199
Reputation: 262214
One option using re.search
:
import re
pat = re.compile('WAT?R')
out = df.join(pd.DataFrame([m.span(0) if (m:=pat.search(x)) else (pd.NA,)*2
for x in df['Protein_sequence']],
index=df.index, columns=['start', 'end'])
).dropna(subset=['start', 'end'], how='all')
Output (on a modified input):
Protein_name Protein_sequence start end
242 >PST130_487694 MLRFFRLAALVLLMTSWARAGDTYDPKTKTTYFGCHKNVDAVCSEP... 16 19
358 >Pucstr1_10722 MLRFFRSIALVWATRSWEVSTAGKYPNNPDPVNGAKYFGCHKNVDA... 11 15
Upvotes: 0