Reputation: 900
I have a substantial database where I'm removing leading text of various lengths. Here's a minimal working example:
data = {'Title' : ['Bertram, C. et al., 2015a: Carbon',
'Bertram, C. et al., 2015b: Complementing',
'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])
which gives
Title
0 Bertram, C. et al., 2015a: Carbon
1 Bertram, C. et al., 2015b: Complementing
2 Bertram, C. et al., 2018: Targeted
First attempt
I apply re
within pandas replace
method:
df['Title'].replace(r'(\A[\D\s.,]*\d\d\d\d[ab:] )', '', regex=True, inplace=True)
But that doesn't address all cases:
Title
0 Bertram, C. et al., 2015a: Carbon
1 Bertram, C. et al., 2015b: Complementing
2 Targeted
Second Attempt
I use the regex
command within replace
:
df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\d:)',
r'(\A[\D\s.,]*\d\d\d\da:)'
r'(\A[\D\s.,]*\d\d\d\db:)'], value='', inplace=True)
But that gives the same results.
Title
0 Bertram, C. et al., 2015a: Carbon
1 Bertram, C. et al., 2015b: Complementing
2 Targeted
Third Attempt
If I reorder the regex list:
df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\da:)',
r'(\A[\D\s.,]*\d\d\d\db:)'
r'(\A[\D\s.,]*\d\d\d\d:)'], value='', inplace=True)
I get a little improvement, but not enough:
Title
0 Carbon
1 Bertram, C. et al., 2015b: Complementing
2 Targeted
Desired Result
Title
0 Carbon
1 Complementing
2 Targeted
Lack of Related Questions
I've closely looked over the documentation for both re
and panda
's replace
, but something is amiss. None of the SO Q&A come close to this problem.
Upvotes: 0
Views: 57
Reputation: 17164
If you always have colon :
in the end and want to have the last words after it, you may not want to use re
module. Usually re is much slower than simple string operations.
Alternative could be:
data = {'Title' : ['Bertram, C. et al., 2015a: Carbon',
'Bertram, C. et al., 2015b: Complementing',
'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])
df['title2'] = df.Title.str.split(':').str[-1].str.lstrip()
print(df)
output
0 Bertram, C. et al., 2015a: Carbon Carbon
1 Bertram, C. et al., 2015b: Complementing Complementing
2 Bertram, C. et al., 2018: Targeted Targeted
Upvotes: 0
Reputation: 57085
"[ab:]"
means "either a, or b, or :". You need "[ab:]+"
("either a, or b, or :, possibly repeated"), because they are repeated in, e.g., "2015a:"
. With this correction, the first method will work.
Upvotes: 1