Matt Mateo
Matt Mateo

Reputation: 168

Extracting dates using regex following several formats

(?:\d{1,2}[\-\/])?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[\,\.\s]*(?:\d{1,2}[\-\/\.)\s,]*)+(?:\d{2,4})(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[\,\.\s]*(?:\d{1,2}[\-\/\.),]*)

I was trying to extract dates from the text from these ff. format:

Here's a sample. The problem is when it tries to extract from this format 2020 JAN. 1 , 2020 JAN. 01, 2020 Jan. 01, 2020-01-01.

Upvotes: 0

Views: 39

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626932

You can use

pattern = r"""(?ix)
  \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?) [\s.]* (?:0?[1-9]|[12][0-9]|3[01]) [\s,.]* (?:19|20)(?:\d{2})? # Jan 01 2000
|
  (?<!\d)(?:19|20)(?:\d{2})? [\s,.]* (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?) [\s.]* (?:0?[1-9]|[12][0-9]|3[01]) # 2000 Jan 01
|
 (?<!\d)
   (?:
    (?:0?[1-9]|1[012])[-/.]?(?:0?[1-9]|[12][0-9]|3[01])[-/.]?(?:19|20)\d\d # MM/dd/yyyy
     |
    (?:19|20)\d\d[-/.]?(?:0?[1-9]|1[012])[-/.]?(?:0?[1-9]|[12][0-9]|3[01]) # yyyy/MM/dd
   )
 (?!\d)"""

See the regex demo

The i modifier flag enables case insensitive matching and x enables the VERBOSE mode.

Upvotes: 1

Related Questions