How can I find the start and end of a regex match using a python pandas dataframe?

Question

I get DNA or protein sequences from databases. The sequences are aligned, so although I always know one input sequence, it is often truncated and includes gaps in the form of added "-" characters. I first want to find a region in the query string. In this case, a regex search makes perfect sense. I then want to extract the equivalent regions from the other aligned strings (I've named them here "markup" and "hit"). Since the sequences are aligned, the region I want in all strings will have the same start and stop. Is there a simple way to obtain the start and stop of a regex match in a pandas dataframe?

import pandas as pd
import re
q1,q2,q3 = 'MPIMGSSVYITVELAIAVLAILG','MPIMGSSVYITVELAIAVLAILG','MPI-MGSSVYITVELAIAVLAIL'
m1,m2,m3 = '|| ||  ||||||||||||||||','||   | ||| :|| || |:: |','||:    ::|: :||||| |:: '
h1,h2,h3 = 'MPTMGFWVYITVELAIAVLAILG','MP-NSSLVYIGLELVIACLSVAG','MPLETQDALYVALELAIAALSVA' 
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'query':[q1,q2,q3],'markup':[m1,m2,m3],'hit':[h1,h2,h3]})
#create a regex search string to find the appropriate subset in the query sequence, 
desired_region_from_query = 'PIMGSS'
regex_desired_region_from_query = '(P-*I-*M-*G-*S-*S-*)'

Pandas has a nice extract function to slice out the matched sequence from the query:

df['query'].str.extract(regex_desired_region_from_query)

However I need the start and end of the match in order to extract the equivalent regions from the markup and hit columns. For a single string, this is done as follows:

match = re.search(regex_desired_region_from_query, df.loc[2,'query'])
sliced_hit = df.loc[2,'hit'][match.start():match.end()]
sliced_hit
Out[3]:'PLETQDA'

My current workaround is as follows. (Edited to include nhahtdh's suggestion and therefore avoid searching twice.)

#define function to obtain regex output (start, stop, etc) as a tuple
def get_regex_output(x):
    m = re.search(regex_desired_region_from_query, x)
    return (m.start(), m.end())
#apply function
df['regex_output_tuple'] = df['query'].apply(get_regex_output)
#convert the tuple into two separate columns
columns_from_regex_output = ['start','end']      
for n, col in enumerate(columns_from_regex_output):
    df[col] = df['regex_output_tuple'].apply(lambda x: x[n])
#delete the unnecessary column
df = df.drop('regex_output_tuple', axis=1)

Now I want to use the obtained start and end integers to slice the strings. This code would be nice:
df.sliced = df.string[df.start:df.end]
But I don't think it currently exists. Instead I have once again used lambda functions:

#create slice functions
fn_slice_hit = lambda x : x['hit'][x['start']:x['end']]
fn_slice_markup = lambda x : x['markup'][x['start']:x['end']]

#apply the slice functions
df['sliced_markup'] = df.apply(fn_slice_markup, axis = 1)
df['sliced_hit'] = df.apply(fn_slice_hit, axis = 1)
print(df)

                       hit                   markup                    query   start  end sliced_markup sliced_hit
0  MPTMGFWVYITVELAIAVLAILG  || ||  ||||||||||||||||  MPIMGSSVYITVELAIAVLAILG       1    7        | ||       PTMGFW
1  MP-NSSLVYIGLELVIACLSVAG  ||   | ||| :|| || |:: |  MPIMGSSVYITVELAIAVLAILG       1    7        |   |      P-NSSL
2  MPLETQDALYVALELAIAALSVA  ||:    ::|: :||||| |::   MPI-MGSSVYITVELAIAVLAIL       1    8       |:    :    PLETQDA

Do pandas .match, .extract, .findall functions have the equivalent of a .start() or .end() attribute?
Is there a way to slice more elegantly?
Any help would be appreciated!

How can I find the start and end of a regex match using a python pandas dataframe?

Answers (1)

Related Questions