Bluetail
Bluetail

Reputation: 1291

Extractall dates - how to separate single years with RegEx in python?

I have got some dates included within the test in one of the columns in my dataframe. for example,

sr = pd.Series(['04/20/2009', '04/20/09', '4/20/09', '4/3/09', '6/2008','12/2009','2010'])

I want to extract these dates.. and half of my year ends up in the 'month' and 'day' columns.

result = sr.str.extractall(r'(?P<month>\d{,2})[/]?(?P<day>\d{,2})[/]?(?P<year>\d{2,4})')
result

      month day year
  match         
0   0   04  20  2009
1   0   04  20  09
2   0   4   20  09
3   0   4   3   09
4   0   6   20  08
5   0   12  20  09
6   0   20  NaN 10

how can I fix this?

I can only think of processing "'6/2008','12/2009','2010'" separately from "'04/20/2009', '04/20/09', '4/20/09'", and then appending them.

Upvotes: 1

Views: 98

Answers (1)

The fourth bird
The fourth bird

Reputation: 163352

You could make the match a bit more specific for the months and days.

As there is always a year, you can make the whole group for the month and day optional.

In that optional group, you can match a month with an optional day.

(?<!\S)(?:(?P<month>1[0-2]|0?[1-9])/(?:(?P<day>3[01]|[12][0-9]|0?[1-9])/)?)?(?P<year>(?:20|19)?\d{2})(?!\S)

In parts

  • (?<!\S) Negative lookbehind, assert what is directly to the left is not a non whitespace char (whitespace boundary to the left)
  • (?: Non capture group
    • (?P<month>1[0-2]|0?[1-9])/ Group month followed by /
    • (?: Non capture group
      • (?P<day>3[01]|[12][0-9]|0?[1-9])/ Group day followed by /
    • )? Close group and make it optional
  • )? Close group and make it optional
  • (?P<year>(?:20|19)?\d{2}) Group year, optionally match either 20 or 19 and 2 digits
  • (?!\S) Negative lookahead, assert not a non whitespace char directly to the right (whitespace boundary to the right)

Regex demo

Upvotes: 6

Related Questions