Reputation: 1291
I have got some dates included within the test in one of the columns in my dataframe. for example,
sr = pd.Series(['04/20/2009', '04/20/09', '4/20/09', '4/3/09', '6/2008','12/2009','2010'])
I want to extract these dates.. and half of my year ends up in the 'month' and 'day' columns.
result = sr.str.extractall(r'(?P<month>\d{,2})[/]?(?P<day>\d{,2})[/]?(?P<year>\d{2,4})')
result
month day year
match
0 0 04 20 2009
1 0 04 20 09
2 0 4 20 09
3 0 4 3 09
4 0 6 20 08
5 0 12 20 09
6 0 20 NaN 10
how can I fix this?
I can only think of processing "'6/2008','12/2009','2010'" separately from "'04/20/2009', '04/20/09', '4/20/09'", and then appending them.
Upvotes: 1
Views: 98
Reputation: 163352
You could make the match a bit more specific for the months and days.
As there is always a year, you can make the whole group for the month and day optional.
In that optional group, you can match a month with an optional day.
(?<!\S)(?:(?P<month>1[0-2]|0?[1-9])/(?:(?P<day>3[01]|[12][0-9]|0?[1-9])/)?)?(?P<year>(?:20|19)?\d{2})(?!\S)
In parts
(?<!\S)
Negative lookbehind, assert what is directly to the left is not a non whitespace char (whitespace boundary to the left)(?:
Non capture group
(?P<month>1[0-2]|0?[1-9])/
Group month followed by /
(?:
Non capture group
(?P<day>3[01]|[12][0-9]|0?[1-9])/
Group day followed by /
)?
Close group and make it optional)?
Close group and make it optional(?P<year>(?:20|19)?\d{2})
Group year, optionally match either 20 or 19 and 2 digits(?!\S)
Negative lookahead, assert not a non whitespace char directly to the right (whitespace boundary to the right)Upvotes: 6