extract substring between multiple words in a pandas dataframe

Question

I have a pandas data frame where I need to extract sub-string from each row of a column based on the following conditions

We have start_list ('one','once I','he') and end_list ('fine','one','well').
The sub-string should be preceded by any of the elements of the start_list.
The sub-string may be succeeded by any of the elements of the end_list.
When any of the elements of the start_list is available then the succeeding sub string should be extracted with/without the presence of the elements of the end_list.

Example Problem:

df = pd.DataFrame({'a' : ['one was fine today', 'we had to drive', ' ','I 
                     think once I was fine eating ham ', 'he studies really 
                     well 
                     and is polite ', 'one had to live well and prosper', 
                     '43948785943one by onej89044809', '827364hjdfvbfv', 
                     '&^%$&*+++===========one kfnv dkfjn uuoiu fine', 'they 
                     is one who makes me crazy'], 
                  'b' : ['11', '22', '33', '44', '55', '66', '77', '', '88', 
                     '99']})

Expected Result:

df = pd.DataFrame({'a' : ['was', '','','was ','studies really','had to live',
                       'by','','kfnv dkfjn uuoiu','who makes me crazy'],
                   'b' : ['11', '22', '33', '44', '55', '66', '77', '', 
                       '88','99']})

abe · Accepted Answer

I think this should work for you. This solution requires Pandas of course and also the built-in library functools.

Function: remove_preceders

This function takes as input a collection of words start_list and str string. It looks to see if any of the items in start_list are in string, and if so returns only the piece of string that occurs after said items. Otherwise, it returns the original string.

def remove_preceders(start_list, string):
    for word in start_list:
        if word in string:
            string = string[string.find(word) + len(word):]
    return string

Function: remove_succeders

This function is very similar to the first, except it returns only the piece of string that occurs before the items in end_list.

def remove_succeeders(end_list, string):
    for word in end_list:
        if word in string:
            string = string[:string.find(word)]
    return string

Function: to_apply

How do you actually run the above functions? The apply method allows you to run complex functions on a DataFrame or Series, but it will then look for as input either a full row or single value, respectively (based on whether you're running on a DF or S).

This function takes as input a function to run & a collection of words to check, and we can use it to run the above two functions:

def to_apply(func, words_to_check):
    return functools.partial(func, words_to_check)

How to Run

df['no_preceders'] = df.a.apply(
                         to_apply(remove_preceders, 
                                 ('one', 'once I', 'he'))
                               )
df['no_succeders'] = df.a.apply(
                          to_apply(remove_succeeders, 
                                  ('fine', 'one', 'well'))
                               )
df['substring'] = df.no_preceders.apply(
                          to_apply(remove_succeeders, 
                                  ('fine', 'one', 'well'))
                               )

And then there's one final step to remove the items from the substring column that were not affected by the filtering:

def final_cleanup(row):
    if len(row['a']) == len(row['substring']):
        return ''
    else:
        return row['substring']

df['substring'] = df.apply(final_cleanup, axis=1)

Results

Hope this works.

extract substring between multiple words in a pandas dataframe

Answers (1)

Related Questions