shan
shan

Reputation: 477

Python:how to extract date using regex

I would like to extract dates that is only in the specific format "Month day, year".If it is in any other format, I will skip it. I used the below regex function but only the month is being displayed not the day and year. Can some one point out what is wrong

>>> date_pattern="(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May?|June?
|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?\
s+\d{2},\s+\d{4})"

s = "the date is November 15, 2009"
print(re.findall(date_pattern,s))

Output expected : November 15, 2009

Output of the above code : "November"

Upvotes: 4

Views: 608

Answers (3)

kerwei
kerwei

Reputation: 1842

You missed out the closing parenthesis in your regex pattern. It should come after December to complete the non-capturing group.

(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June|July|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{2},\s+\d{4}

Edit: Actually, it's the positioning of your parenthesis that is incorrect. Instead of being at the end of the pattern, it should come after the December alternative because that is your non-capturing group for month names.

Upvotes: 1

Allan
Allan

Reputation: 12438

You can change the regex into:

(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May?|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{2},\s+\d{4})

Explanations:

Your current regex accepts the pattern detailed here:

Demo: https://regex101.com/r/0teiAB/3

If you do not add the parenthesis the regex will accept either one of the months defined or Dec(?:ember)?)\s+\d{2},\s+\d{4}) - Dec/December followed by the day and year

Demo: https://regex101.com/r/0teiAB/1

Additional notes:

  • for the days, \d{2} will also accept 33,99,00 that are not proper calendar days!!! -> You can replace this part by (?:0?[1-9]|[1-2][0-9]|30|31) to limit the range as shown in:

Demo: https://regex101.com/r/NTIyf7/1

  • This is not enough if you want to limit the maximum day per month (as there is no 31 February for example), if you want to go to that level of precision you will need to change the regex and use a similar expression as what I have introduced hereover to limit each month.

  • Last but not least, if you go even further and want to defined leap year with February 29. Regex might not be the proper tool for this and you will have to use a Date/Calendar to verify if your date is valid or not.

Upvotes: 1

U13-Forward
U13-Forward

Reputation: 71570

Or use re.search with group(0):

>>> date_pattern='(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}'
>>> s = "the date is November 15, 2009"
>>> re.search(date_pattern,s).group(0)
'November 15, 2009'
>>> 

Visit the regex101 i created for it.

Upvotes: 2

Related Questions