Python pandas `replace` is not acting consistent

Question

I have a substantial database where I'm removing leading text of various lengths. Here's a minimal working example:

data = {'Title' : ['Bertram, C. et al., 2015a: Carbon', 
                   'Bertram, C. et al., 2015b: Complementing', 
                   'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])

which gives

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2        Bertram, C. et al., 2018: Targeted

First attempt

I apply re within pandas replace method:

df['Title'].replace(r'(\A[\D\s.,]*\d\d\d\d[ab:] )', '', regex=True, inplace=True)

But that doesn't address all cases:

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

Second Attempt

I use the regex command within replace:

df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\d:)', 
                           r'(\A[\D\s.,]*\d\d\d\da:)'
                           r'(\A[\D\s.,]*\d\d\d\db:)'], value='', inplace=True)

But that gives the same results.

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

Third Attempt

If I reorder the regex list:

df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\da:)', 
                           r'(\A[\D\s.,]*\d\d\d\db:)'
                           r'(\A[\D\s.,]*\d\d\d\d:)'], value='', inplace=True)

I get a little improvement, but not enough:

                                      Title
0                                    Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

Desired Result

    Title
0   Carbon
1   Complementing
2   Targeted

Lack of Related Questions

I've closely looked over the documentation for both re and panda's replace, but something is amiss. None of the SO Q&A come close to this problem.

DYZ · Accepted Answer

"[ab:]" means "either a, or b, or :". You need "[ab:]+" ("either a, or b, or :, possibly repeated"), because they are repeated in, e.g., "2015a:". With this correction, the first method will work.

Python pandas `replace` is not acting consistent

Answers (2)

Related Questions