Using regex in second series.str.replace() argument

Question

I'm working with a DataFrame with a year column with the following format:

I want to replace strings like '2015-2016' with just '2015' using regex. I tried something like this:

df['year']=df['year'].str.replace('[0-9]{4}\-[0-9]{4}','[0-9]{4}')

But that doesn't work. I know I could do smething like:

df['year']=df['year'].str.replace('\-[0-9]{4}','')

But sometimes you need something more flexible. Is there any way to keep a portion of the match in the regex or is this one the standard approach?

Thanks in advance.

sacuL · Accepted Answer

If you just want to keep the first year, and all years have 4 digits, use:

df['year'] = df.year.str.extract('(\d{4})')
>>> df
   year
0  2015
1  2015
2  2016

If you want to keep the first year before any -, use:

df['year'] = df.year.str.split('-').str[0]

>>> df
   year
0  2015
1  2015
2  2016

Answers (2)