python parse string from multi valued column

Question

I'm reading an excel file into a dataframe where one of its columns has 1 to many values, delimited by a space. I need to search the column for a string of text, and if found return the complete value between its delimiters(not the entire cell value).

Input would look something like this –

import pandas as pd

df = pd.DataFrame({'command': ['abc123', 'abcdef', 'hold', 'why'],
                   'name': ['/l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah',
                            '/l1/l2/fred/a/b/c /l1/l2/bad/b /l1/l2/fred/x',
                            '/l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha',
                            '/la/lb2/sure/a/b/c'],
                   'date': ['2020-05', '2020-05', '2020-05', '2020-06']})

With a print it returns -

  command     date                                               name
0  abc123  2020-05  /l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah
1  abcdef  2020-05       /l1/l2/fred/a/b/c /l1/l2/bad/b /l1/l2/fred/x
2    hold  2020-05  /l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha
3     why  2020-06                                 /la/lb2/sure/a/b/c

I'm using the following to parse for the strings I'm after –

terms = ['/l1/l2/good', '/la/lb/sure']
df = df[df['name'].str.contains('|'.join(terms))]

which returns –

  command     date                                               name
0  abc123  2020-05  /l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah
2    hold  2020-05  /l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha 
3     why  2020-06                                  /la/lb/sure/a/b/c

What I would like returned is –

  command     date                                               name
0  abc123  2020-05                  /l1/l2/good/a/b/c  /la/lb/sure/blah
2    hold  2020-05                                       /l1/l2/good/ha 
3     why  2020-06                                    /la/lb/sure/a/b/c

I had tried performing a split on the space delimiter, but then I'm unable to loop through those values to parse them as needed.

Thanks.

aneroid · Accepted Answer

The problem with using replace() is that it will only apply to what matches, and not to the items you want to skip/remove from the results. Instead, use findall():

pat = f"((?:{'|'.join(terms)})[^ ]*)"
# `pat` is ((?:/l1/l2/good|/la/lb/sure)[^ ]*)

That regex attempts to match with each pattern and everything that follows it except a space.

df['name'].str.findall(pat)  # will give all the matches, as a series:

0    [/l1/l2/good/a/b/c, /la/lb/sure/blah]
1                                       []
2                         [/l1/l2/good/ha]
3                      [/la/lb/sure/a/b/c]
Name: name, dtype: object

join the results with a space and assign it back to 'name':

df['name'] = df['name'].str.findall(pat).str.join(' ')

Result df:

  command                                name     date
0  abc123  /l1/l2/good/a/b/c /la/lb/sure/blah  2020-05
1  abcdef                                      2020-05
2    hold                      /l1/l2/good/ha  2020-05
3     why                   /la/lb/sure/a/b/c  2020-06

(Btw, in your DataFrame definition, your last entry is '/la/lb2/sure/a/b/c' instead of '/la/lb/sure/a/b/c'. I've changed that. And date is the last column in your def.)

python parse string from multi valued column

Answers (2)

Related Questions