Reputation: 2605
I am trying to write a regex to identify some dates.
the string I am working on is :
string:
'these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000\
these are dates 4-02-2011, 12/12/1990, 31-11-1690, 11 July 1990, 7 Oct 2012\
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave.'
The regex looks like :
re.findall('(\
[\b, ]\
([1-9]|0[1-9]|[12][0-9]|3[01])\
[-/.\s+]\
(1[1-2]|0[1-9]|[1-9]|Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|Aug|August|Sept|September|Oct|October|Nov|November|Dec|December)\
(?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?\
[^\da-zA-Z])',String)
The output I get is :
[(' 11-2-', '11', '2', ''),
(' 24-3-1695-', '24', '3', '1695'),
(' 4-02-2011,', '4', '02', '2011'),
(' 12/12/1990,', '12', '12', '1990'),
(' 31-11-1690,', '31', '11', '1690'),
(' 11 July 1990,', '11', 'July', '1990'),
(' 7 Oct 2012 ', '7', 'Oct', '2012'),
(' 12 December ', '12', 'December', ''),
(' 5 July 2001,', '5', 'July', '2001')]
Problems:
The first two output are wrong, they come because of the optional expression ((?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?)
put to handle cases like "12 December"
. How do I get rid of them?
There is a case "June 2000"
that is not handles by the expression.
Can I implement something with the expression that could handle this case without affecting others?
Upvotes: 4
Views: 1093
Reputation: 1284
@Martin Evans answer was great but I wanted to also return the locations of the match within the string:
>>> text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000
these are dates 4-02-2011, 12/12/1990, 31-11-1690, 11 July 1990, 7 Oct 2012
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave."""
>>> find_dates(text)
[('2011-02-04', 90, 99, '4-02-2011'),
('1990-12-12', 101, 111, '12/12/1990'),
('1990-07-11', 126, 138, '11 July 1990'),
('2012-10-07', 140, 150, '7 Oct 2012'),
('2022-12-12', 177, 192, '12 December six'),
('2000-06-01', 212, 224, 'June 2000 he'),
('2001-07-05', 234, 245, '5 July 2001')]
I have wrapped it up in a function and users finditer
instead of findall
from itertools import tee
from datetime import datetime
import re
def find_dates(
text,
valid_from = datetime(1920, 1, 1),
valid_to = datetime(2030, 1, 1),
default_year = datetime.now().year,
dt_formats = [
['%d', '%m', '%Y'],
['%d', '%b', '%Y'],
['%d', '%B', '%Y'],
['%d', '%b'],
['%d', '%B'],
['%b', '%d'],
['%B', '%d'],
['%b', '%Y'],
['%B', '%Y'],
],
):
# store your matches here
dates = []
t1, t2, t3 = tee(list(re.finditer(r'\b\w+\b', text)), 3)
next(t2, None)
next(t3, None)
next(t3, None)
triples = zip(t1, t2, t3)
for triple in triples:
# get start and end index of each triple
start = triple[0].start()
end = triple[-1].end()
# convert mathes to a list of three strings
triple = [text[t.start():t.end()] for t in triple]
for dt_format in dt_formats:
try:
dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format))
if '%Y' not in dt_format:
dt = dt.replace(year=default_year)
if valid_from <= dt <= valid_to:
dates.append((dt.strftime('%Y-%m-%d'), start, end, text[start:end]))
for skip in range(1, len(dt_format)):
next(triples)
break
except ValueError:
pass
return dates
There is some bug though as you can see ('2000-06-01', 212, 224, 'June 2000 he')
. Although a better approach may be to do something with dateutil.parser.parse
like in https://stackoverflow.com/a/33051237/5125264
Upvotes: 1
Reputation: 15
Use this : r'\d{,2}-[A-Za-z]{,9}-\d{,4}'
import re
re.match(r'\d{,2}\-[A-Za-z]{,9}\-\d{,4}','Your Date')
This can match dates of formats : '14-Jun-2021' , '4-september-20'
Upvotes: 0
Reputation: 46759
I would avoid trying to get a regular expression to parse your dates. As you have found, it starts ok but soon becomes harder to catch edge cases, for example invalid dates, e.g. 31/09/2018
A safer approach is to let Python's datetime
decide if a date is valid or not. You can then easily specify valid date ranges and allowed date formats.
This script works by using the regular expression to extract all words and number groups. It then takes three parts at a time and applies the allowed date formats. If datetime
succeeds in parsing a given format, it is tested to ensure it falls within your allowed date ranges. If valid, the matching parts are skipped over to avoid a second match on a partial date.
If the date found does not contain a year, a default_year
is assumed:
from itertools import tee
from datetime import datetime
import re
valid_from = datetime(1920, 1, 1)
valid_to = datetime(2030, 1, 1)
default_year = 2018
dt_formats = [
['%d', '%m', '%Y'],
['%d', '%b', '%Y'],
['%d', '%B', '%Y'],
['%d', '%b'],
['%d', '%B'],
['%b', '%d'],
['%B', '%d'],
['%b', '%Y'],
['%B', '%Y'],
]
text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000
these are dates 4-02-2011, 12/12/1990, 31-11-1690, 11 July 1990, 7 Oct 2012
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave."""
t1, t2, t3 = tee(re.findall(r'\b\w+\b', text), 3)
next(t2, None)
next(t3, None)
next(t3, None)
triples = zip(t1, t2, t3)
for triple in triples:
for dt_format in dt_formats:
try:
dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format))
if '%Y' not in dt_format:
dt = dt.replace(year=default_year)
if valid_from <= dt <= valid_to:
print(dt.strftime('%d-%m-%Y'))
for skip in range(1, len(dt_format)):
next(triples)
break
except ValueError:
pass
For the text you have given, this would display:
04-02-2011
12-12-1990
11-07-1990
07-10-2012
12-12-2018
01-06-2000
05-07-2001
Upvotes: 2