Regex not extracting correct value from behind

Question

For the string = "4/3/09" using

df['dates'] = df['dates'].str.replace(r'([/ ]\d\d)\b', r'19\g<0>')
#or
df['dates'] = df['dates'].str.replace(r'([/ ]\d\d)$', r'19\g<0>')

I am getting 4/319/09 but I should get 4/3/1909

My data:

date_set = ['04/20/2009', '04/20/09', '4/20/09', '4/3/09',
'Mar-20-2009', 'Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 
'Mar 20 2009','20 Mar 2009', '20 March 2009', '20 Mar. 2009', 
'20 March, 2009','Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009',
'Feb 2009', 'Sep 2009', 'Oct 2010',
'6/2008', '12/2009',
'2009', '2010']

If there is 2 digit year i need to add 1900. Ex - if year is 09, it should get replaced with 1909

Wiktor Stribiżew · Accepted Answer

The ([/ ]\d\d)\b pattern matches / or space and then 2 digits up to a word boundary, and str.replace replaces the match (here, /09) with 19 + the whole match resulting in 4/3 + 19/09 => 4/319/09.

You need to use

df['dates'] = df['dates'].str.replace(r'([/ ])(\d\d)\b', r'\g<1>19\2')

See the regex demo

Here,

([/ ]) - Capturing group 1: a / or space
(\d\d) - Capturing group 2: two digits
\b - word boundary

The replacement is r'\g<1>19\2, i.e. Group 1 (here, an unambiguous backreference to Group 1 is used since the next char in the replacement pattern is a digit, see python re.sub group: number after umber) + 19 and Group 2 value (here, \2 is a regular numeric backreference is used since there is nothing following the pattern).

See re.sub Python documentation.

EDIT

After you added more data, it seems you need to only match the two digits at the end of the string.

Use

df['dates'] = df['dates'].str.replace(r'([/ ])(\d\d)$', r'\g<1>19\2')
df['dates'] = df['dates'].str.replace(r'(?<=[/ ])(?=\d\d$)', '19')

The second line removes the problem wtith backreferences since it uses lookarounds.

Regex not extracting correct value from behind

Answers (1)

Related Questions