Mark Teese
Mark Teese

Reputation: 691

How can I find the start and end of a regex match using a python pandas dataframe?

I get DNA or protein sequences from databases. The sequences are aligned, so although I always know one input sequence, it is often truncated and includes gaps in the form of added "-" characters. I first want to find a region in the query string. In this case, a regex search makes perfect sense. I then want to extract the equivalent regions from the other aligned strings (I've named them here "markup" and "hit"). Since the sequences are aligned, the region I want in all strings will have the same start and stop. Is there a simple way to obtain the start and stop of a regex match in a pandas dataframe?

import pandas as pd
import re
q1,q2,q3 = 'MPIMGSSVYITVELAIAVLAILG','MPIMGSSVYITVELAIAVLAILG','MPI-MGSSVYITVELAIAVLAIL'
m1,m2,m3 = '|| ||  ||||||||||||||||','||   | ||| :|| || |:: |','||:    ::|: :||||| |:: '
h1,h2,h3 = 'MPTMGFWVYITVELAIAVLAILG','MP-NSSLVYIGLELVIACLSVAG','MPLETQDALYVALELAIAALSVA' 
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'query':[q1,q2,q3],'markup':[m1,m2,m3],'hit':[h1,h2,h3]})
#create a regex search string to find the appropriate subset in the query sequence, 
desired_region_from_query = 'PIMGSS'
regex_desired_region_from_query = '(P-*I-*M-*G-*S-*S-*)'

Pandas has a nice extract function to slice out the matched sequence from the query:

df['query'].str.extract(regex_desired_region_from_query)

However I need the start and end of the match in order to extract the equivalent regions from the markup and hit columns. For a single string, this is done as follows:

match = re.search(regex_desired_region_from_query, df.loc[2,'query'])
sliced_hit = df.loc[2,'hit'][match.start():match.end()]
sliced_hit
Out[3]:'PLETQDA'

My current workaround is as follows. (Edited to include nhahtdh's suggestion and therefore avoid searching twice.)

#define function to obtain regex output (start, stop, etc) as a tuple
def get_regex_output(x):
    m = re.search(regex_desired_region_from_query, x)
    return (m.start(), m.end())
#apply function
df['regex_output_tuple'] = df['query'].apply(get_regex_output)
#convert the tuple into two separate columns
columns_from_regex_output = ['start','end']      
for n, col in enumerate(columns_from_regex_output):
    df[col] = df['regex_output_tuple'].apply(lambda x: x[n])
#delete the unnecessary column
df = df.drop('regex_output_tuple', axis=1)

Now I want to use the obtained start and end integers to slice the strings. This code would be nice:
df.sliced = df.string[df.start:df.end]
But I don't think it currently exists. Instead I have once again used lambda functions:

#create slice functions
fn_slice_hit = lambda x : x['hit'][x['start']:x['end']]
fn_slice_markup = lambda x : x['markup'][x['start']:x['end']]

#apply the slice functions
df['sliced_markup'] = df.apply(fn_slice_markup, axis = 1)
df['sliced_hit'] = df.apply(fn_slice_hit, axis = 1)
print(df)

                       hit                   markup                    query   start  end sliced_markup sliced_hit
0  MPTMGFWVYITVELAIAVLAILG  || ||  ||||||||||||||||  MPIMGSSVYITVELAIAVLAILG       1    7        | ||       PTMGFW
1  MP-NSSLVYIGLELVIACLSVAG  ||   | ||| :|| || |:: |  MPIMGSSVYITVELAIAVLAILG       1    7        |   |      P-NSSL
2  MPLETQDALYVALELAIAALSVA  ||:    ::|: :||||| |::   MPI-MGSSVYITVELAIAVLAIL       1    8       |:    :    PLETQDA

Do pandas .match, .extract, .findall functions have the equivalent of a .start() or .end() attribute?
Is there a way to slice more elegantly?
Any help would be appreciated!

Upvotes: 3

Views: 1866

Answers (1)

jkitchen
jkitchen

Reputation: 1060

I don't think this exists in pandas, but would be a great addition. Go to https://github.com/pydata/pandas/issues and add a new Issue. Explain that it's an enhancement that you'd like to see.

For the .start() and .end() method, those probably make more sense as kwargs to the extract() method. If str.extract(pat, start_index=True), then returns a Series or Dataframe of start indexes rather than the value of the capture group. Same goes for end_index=True. Those probably need to be mutually exclusive.

I also like your suggestion of

df.sliced = df.string[df.start:df.end]

Pandas already has a str.slice method

df.sliced = df.string.str.slice(1, -1)

But those have to be ints. Add a separate issue on Github to have the str.slice method take series objects and apply element-wise.

Sorry to not have a better solution than your lambda hack, but it's use-cases like these that help drive Pandas to be better.

Upvotes: 1

Related Questions