Sam
Sam

Reputation: 2605

Python regex to handle different types of dates

I am trying to write a regex to identify some dates.

the string I am working on is :

string:
'these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000\
 these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012\
 these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave.'

The regex looks like :

re.findall('(\
[\b, ]\
([1-9]|0[1-9]|[12][0-9]|3[01])\
[-/.\s+]\
(1[1-2]|0[1-9]|[1-9]|Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|Aug|August|Sept|September|Oct|October|Nov|November|Dec|December)\
(?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?\
[^\da-zA-Z])',String)

The output I get is :

[(' 11-2-', '11', '2', ''),
 (' 24-3-1695-', '24', '3', '1695'),
 (' 4-02-2011,', '4', '02', '2011'),
 (' 12/12/1990,', '12', '12', '1990'),
 (' 31-11-1690,', '31', '11', '1690'),
 (' 11 July 1990,', '11', 'July', '1990'),
 (' 7 Oct 2012 ', '7', 'Oct', '2012'),
 (' 12 December ', '12', 'December', ''),
 (' 5 July 2001,', '5', 'July', '2001')]

Problems:

  1. The first two output are wrong, they come because of the optional expression ((?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?) put to handle cases like "12 December". How do I get rid of them?

  2. There is a case "June 2000" that is not handles by the expression.
    Can I implement something with the expression that could handle this case without affecting others?

Upvotes: 4

Views: 1093

Answers (3)

Matt
Matt

Reputation: 1284

@Martin Evans answer was great but I wanted to also return the locations of the match within the string:

>>> text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000
these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave."""

>>> find_dates(text)

[('2011-02-04', 90, 99, '4-02-2011'),
 ('1990-12-12', 101, 111, '12/12/1990'),
 ('1990-07-11', 126, 138, '11 July 1990'),
 ('2012-10-07', 140, 150, '7 Oct 2012'),
 ('2022-12-12', 177, 192, '12 December six'),
 ('2000-06-01', 212, 224, 'June 2000 he'),
 ('2001-07-05', 234, 245, '5 July 2001')]

I have wrapped it up in a function and users finditer instead of findall

from itertools import tee
from datetime import datetime
import re

def find_dates(
    text,
    valid_from = datetime(1920, 1, 1),
    valid_to = datetime(2030, 1, 1),
    default_year = datetime.now().year,
    dt_formats = [
        ['%d', '%m', '%Y'], 
        ['%d', '%b', '%Y'],
        ['%d', '%B', '%Y'],
        ['%d', '%b'],
        ['%d', '%B'],
        ['%b', '%d'],
        ['%B', '%d'],
        ['%b', '%Y'],
        ['%B', '%Y'],
    ],
    ):
    # store your matches here
    dates = []
        
    t1, t2, t3 = tee(list(re.finditer(r'\b\w+\b', text)), 3)
    next(t2, None)
    next(t3, None)
    next(t3, None)
    triples = zip(t1, t2, t3)

    for triple in triples:
        # get start and end index of each triple
        start = triple[0].start()
        end = triple[-1].end()

        # convert mathes to a list of three strings
        triple = [text[t.start():t.end()] for t in triple]

        for dt_format in dt_formats:
            try:
                dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format))

                if '%Y' not in dt_format:
                    dt = dt.replace(year=default_year)

                if valid_from <= dt <= valid_to:
                    dates.append((dt.strftime('%Y-%m-%d'), start, end, text[start:end]))

                    for skip in range(1, len(dt_format)):
                        next(triples)
                break

            except ValueError:
                pass
            
    return dates

There is some bug though as you can see ('2000-06-01', 212, 224, 'June 2000 he'). Although a better approach may be to do something with dateutil.parser.parse like in https://stackoverflow.com/a/33051237/5125264

Upvotes: 1

Rakshith N
Rakshith N

Reputation: 15

Use this : r'\d{,2}-[A-Za-z]{,9}-\d{,4}'

import re
re.match(r'\d{,2}\-[A-Za-z]{,9}\-\d{,4}','Your Date')

This can match dates of formats : '14-Jun-2021' , '4-september-20'

Upvotes: 0

Martin Evans
Martin Evans

Reputation: 46759

I would avoid trying to get a regular expression to parse your dates. As you have found, it starts ok but soon becomes harder to catch edge cases, for example invalid dates, e.g. 31/09/2018

A safer approach is to let Python's datetime decide if a date is valid or not. You can then easily specify valid date ranges and allowed date formats.

This script works by using the regular expression to extract all words and number groups. It then takes three parts at a time and applies the allowed date formats. If datetime succeeds in parsing a given format, it is tested to ensure it falls within your allowed date ranges. If valid, the matching parts are skipped over to avoid a second match on a partial date.

If the date found does not contain a year, a default_year is assumed:

from itertools import tee
from datetime import datetime
import re


valid_from = datetime(1920, 1, 1)
valid_to = datetime(2030, 1, 1)
default_year = 2018

dt_formats = [
    ['%d', '%m', '%Y'], 
    ['%d', '%b', '%Y'],
    ['%d', '%B', '%Y'],
    ['%d', '%b'],
    ['%d', '%B'],
    ['%b', '%d'],
    ['%B', '%d'],
    ['%b', '%Y'],
    ['%B', '%Y'],
]

text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000
these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave."""

t1, t2, t3 = tee(re.findall(r'\b\w+\b', text), 3)
next(t2, None)
next(t3, None)
next(t3, None)
triples = zip(t1, t2, t3)

for triple in triples:
    for dt_format in dt_formats:
        try:
            dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format))

            if '%Y' not in dt_format:
                dt = dt.replace(year=default_year)

            if valid_from <= dt <= valid_to:
                print(dt.strftime('%d-%m-%Y'))

                for skip in range(1, len(dt_format)):
                    next(triples)
            break

        except ValueError:
            pass

For the text you have given, this would display:

04-02-2011
12-12-1990
11-07-1990
07-10-2012
12-12-2018
01-06-2000
05-07-2001

Upvotes: 2

Related Questions