Reputation: 6132
I'm working with a DataFrame with a year
column with the following format:
year
2015
2015-2016
2016
I want to replace strings like '2015-2016' with just '2015' using regex. I tried something like this:
df['year']=df['year'].str.replace('[0-9]{4}\-[0-9]{4}','[0-9]{4}')
But that doesn't work. I know I could do smething like:
df['year']=df['year'].str.replace('\-[0-9]{4}','')
But sometimes you need something more flexible. Is there any way to keep a portion of the match in the regex or is this one the standard approach?
Thanks in advance.
Upvotes: 0
Views: 55
Reputation: 45552
You can capture the good year in parenthesis and refer to it in your replacement with \1
:
df['year'].str.replace(r'([0-9]{4})\-[0-9]{4}', r'\1')
Or you can make parenthesis around the good year into a non-capturing positive lookbehind assertion with ?<=
and then the replacement string will be blank because only \-[0-9]{4}
was matched (but only when preceded by [0-9]{4}
).
df['year'].str.replace(r'(?<=[0-9]{4})\-[0-9]{4}', '')
Upvotes: 2
Reputation: 51395
If you just want to keep the first year, and all years have 4 digits, use:
df['year'] = df.year.str.extract('(\d{4})')
>>> df
year
0 2015
1 2015
2 2016
If you want to keep the first year before any -
, use:
df['year'] = df.year.str.split('-').str[0]
>>> df
year
0 2015
1 2015
2 2016
Upvotes: 2