Juan C
Juan C

Reputation: 6132

Using regex in second series.str.replace() argument

I'm working with a DataFrame with a year column with the following format:

  year
  2015
2015-2016
  2016

I want to replace strings like '2015-2016' with just '2015' using regex. I tried something like this:

df['year']=df['year'].str.replace('[0-9]{4}\-[0-9]{4}','[0-9]{4}')

But that doesn't work. I know I could do smething like:

df['year']=df['year'].str.replace('\-[0-9]{4}','')

But sometimes you need something more flexible. Is there any way to keep a portion of the match in the regex or is this one the standard approach?

Thanks in advance.

Upvotes: 0

Views: 55

Answers (2)

Steven Rumbalski
Steven Rumbalski

Reputation: 45552

You can capture the good year in parenthesis and refer to it in your replacement with \1:

df['year'].str.replace(r'([0-9]{4})\-[0-9]{4}', r'\1')

Or you can make parenthesis around the good year into a non-capturing positive lookbehind assertion with ?<= and then the replacement string will be blank because only \-[0-9]{4} was matched (but only when preceded by [0-9]{4}).

df['year'].str.replace(r'(?<=[0-9]{4})\-[0-9]{4}', '')

Upvotes: 2

sacuL
sacuL

Reputation: 51395

If you just want to keep the first year, and all years have 4 digits, use:

df['year'] = df.year.str.extract('(\d{4})')
>>> df
   year
0  2015
1  2015
2  2016

If you want to keep the first year before any -, use:

df['year'] = df.year.str.split('-').str[0]

>>> df
   year
0  2015
1  2015
2  2016

Upvotes: 2

Related Questions