David Collins
David Collins

Reputation: 900

Python pandas `replace` is not acting consistent

I have a substantial database where I'm removing leading text of various lengths. Here's a minimal working example:

data = {'Title' : ['Bertram, C. et al., 2015a: Carbon', 
                   'Bertram, C. et al., 2015b: Complementing', 
                   'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])

which gives

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2        Bertram, C. et al., 2018: Targeted

First attempt

I apply re within pandas replace method:

df['Title'].replace(r'(\A[\D\s.,]*\d\d\d\d[ab:] )', '', regex=True, inplace=True)

But that doesn't address all cases:

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

Second Attempt

I use the regex command within replace:

df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\d:)', 
                           r'(\A[\D\s.,]*\d\d\d\da:)'
                           r'(\A[\D\s.,]*\d\d\d\db:)'], value='', inplace=True)

But that gives the same results.

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

Third Attempt

If I reorder the regex list:

df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\da:)', 
                           r'(\A[\D\s.,]*\d\d\d\db:)'
                           r'(\A[\D\s.,]*\d\d\d\d:)'], value='', inplace=True)

I get a little improvement, but not enough:

                                      Title
0                                    Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

Desired Result

    Title
0   Carbon
1   Complementing
2   Targeted

Lack of Related Questions

I've closely looked over the documentation for both re and panda's replace, but something is amiss. None of the SO Q&A come close to this problem.

Upvotes: 0

Views: 57

Answers (2)

BhishanPoudel
BhishanPoudel

Reputation: 17164

If you always have colon : in the end and want to have the last words after it, you may not want to use re module. Usually re is much slower than simple string operations.

Alternative could be:

data = {'Title' : ['Bertram, C. et al., 2015a: Carbon', 
                   'Bertram, C. et al., 2015b: Complementing', 
                   'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])
df['title2'] = df.Title.str.split(':').str[-1].str.lstrip()

print(df)

output

0         Bertram, C. et al., 2015a: Carbon         Carbon
1  Bertram, C. et al., 2015b: Complementing  Complementing
2        Bertram, C. et al., 2018: Targeted       Targeted

Upvotes: 0

DYZ
DYZ

Reputation: 57085

"[ab:]" means "either a, or b, or :". You need "[ab:]+" ("either a, or b, or :, possibly repeated"), because they are repeated in, e.g., "2015a:". With this correction, the first method will work.

Upvotes: 1

Related Questions