Reputation: 960
I am having a hard time creating a regex (in Python 3.6) with which I could parse a datetime string, following these rules:
YYYYMMDD
, where YYYY
is any year between 2000 and 2099 (inclusive), so that becomes 20yyMMDD
HHMMSS
YYYYMMDDHHMMSS
YYYYMMDDHHMMSS
- OkYYYYMMDD-HHMMSS
- OkYYYYMMDD HHMMSS
- OkYYYYMMDD1HHMMSS
- Not accepted(YYYYMMDDHHMMSS)
- Ok123-YYYYMMDDHHMMSS)123
- Okabc1YYYYMMDDHHMMSS
- Not acceptedI know basics of regex, read many SO answers (found Regex: match everything but, Regex, every non-alphanumeric character except white space or colon and others pretty useful), but just cannot figure out regex to pass all of my test cases.
I need two groups for the actual date and time parsing, that is (20[\d]{6})([\d]{6})
. Then I added support for the additional characters .*(20[\d]{6})[^\d]?([\d]{6}).*
which works fine until there is a numeric character in the front, at the end or in the middle and it shall not match, but it matches. So I started adding different thing in the front or the back, as example (?<![\d])
, .*[^\d]?
, [^\d]?.*
,... but unfortunately my regex knowledge ends soon and the string becomes a mess which I do not understand nor does it work properly.
I made some test strings (each with the desired results) and a simple test function:
import datetime
import re
from typing import Tuple, List
#my_regex = r"(?<![\d])(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
my_regex = r"\b(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
dt = datetime.datetime(2017, 12, 17, 9, 10, 11)
tests: List[Tuple[str, datetime.datetime]] = [
# Clean one.
("20171217091011", dt),
# Character in between.
("20171217a091011", dt),
("20171217b091011", dt),
("20171217-091011", dt),
("20171217_091011", dt),
("20171217 091011", dt),
("201712170091011", None), # Before/in between/at the end in this case.
# Characters in front.
("a20171217091011", dt),
("b20171217091011", dt),
(" 20171217091011", dt),
("-20171217091011", dt),
("_20171217091011", dt),
("020171217091011", None),
("aa20171217091011", dt),
("a1-20171217091011", dt),
("123_20171217091011", dt),
("123 20171217091011", dt),
("123=20171217091011", dt),
("201720171217091011", None),
# Characters at the end.
("20171217091011a", dt),
("20171217091011b", dt),
("20171217091011 ", dt),
("20171217091011-", dt),
("20171217091011_", dt),
("201712170910110", None),
("20171217091011aa", dt),
("20171217091011a1", dt),
("20171217091011-a1", dt),
("20171217091011-123", dt),
("20171217091011_123", dt),
("20171217091011 123", dt),
("20171217091011?123", dt),
# Characters at both ends.
("a20171217091011a", dt),
("(20171217091011)", dt),
("a-20171217091011 b", dt),
("123(20171217091011)456", dt),
(" 20171217091011 ", dt),
("2017 20171217091011 2017", dt),
("20171218-20171217091011-070809", dt),
# Characters at both ends and in the middle.
("123(20171217-091011)456", dt),
("a2017(20171217 091011)b", dt),
("2017xx(20171217?091011)cc2017", dt),
("2017xx(201712170091011)cc2017", None),
("2017xx(201712170091011", None),
# Other cases.
("20171217091011 20171116080910", dt), # Match first.
("A-20171116-080910-20171217091011", datetime.datetime(2017, 11, 16, 8, 9, 10)), # Match first.
]
for test_str, test_time in tests:
match = re.match(my_regex, test_str)
time = None
if match:
try:
time = datetime.datetime.strptime("".join(match.groups()), "%Y%m%d%H%M%S")
except ValueError:
pass
if time != test_time:
print("{: <32s} = {} instead of {}".format(test_str, time, test_time))
But I just cannot get all of the test strings to pass, as example:
a20171217091011 = None instead of 2017-12-17 09:10:11
b20171217091011 = None instead of 2017-12-17 09:10:11
20171217091011 = None instead of 2017-12-17 09:10:11
-20171217091011 = None instead of 2017-12-17 09:10:11
_20171217091011 = None instead of 2017-12-17 09:10:11
aa20171217091011 = None instead of 2017-12-17 09:10:11
a1-20171217091011 = None instead of 2017-12-17 09:10:11
123_20171217091011 = None instead of 2017-12-17 09:10:11
123 20171217091011 = None instead of 2017-12-17 09:10:11
123=20171217091011 = None instead of 2017-12-17 09:10:11
201712170910110 = 2017-12-17 09:10:11 instead of None
a20171217091011a = None instead of 2017-12-17 09:10:11
(20171217091011) = None instead of 2017-12-17 09:10:11
a-20171217091011 b = None instead of 2017-12-17 09:10:11
123(20171217091011)456 = None instead of 2017-12-17 09:10:11
20171217091011 = None instead of 2017-12-17 09:10:11
2017 20171217091011 2017 = None instead of 2017-12-17 09:10:11
20171218-20171217091011-070809 = 2017-12-18 20:17:12 instead of 2017-12-17 09:10:11
123(20171217-091011)456 = None instead of 2017-12-17 09:10:11
a2017(20171217 091011)b = None instead of 2017-12-17 09:10:11
2017xx(20171217?091011)cc2017 = None instead of 2017-12-17 09:10:11
A-20171116-080910-20171217091011 = None instead of 2017-11-16 08:09:10
Thank you for any ideas.
Upvotes: 4
Views: 2798
Reputation: 627100
It seems that you need to check the general pattern with the regex while validating actual date time values with the appropriate Python methods.
So, you may fix the code using the following regex:
r'(?<!\d)20\d{6}\D?\d{6}(?!\d)'
See the regex demo
Details
(?<!\d)
- a negative lookbehind that fails the match if there is a digit immediately to the left of the current position20
- a 20
substring\d{6}
- any 6 digits\D?
- 1 or 0 non-digit chars\d{6}
- any 6 digits(?!\d)
- a negative lookahead that fails the match if there is a digit immediately to the right of the current position.Upvotes: 2