Bojan P.
Bojan P.

Reputation: 960

Regex: match any character zero or more times, except number "touching" the matching group

I am having a hard time creating a regex (in Python 3.6) with which I could parse a datetime string, following these rules:

I know basics of regex, read many SO answers (found Regex: match everything but, Regex, every non-alphanumeric character except white space or colon and others pretty useful), but just cannot figure out regex to pass all of my test cases.

I need two groups for the actual date and time parsing, that is (20[\d]{6})([\d]{6}). Then I added support for the additional characters .*(20[\d]{6})[^\d]?([\d]{6}).* which works fine until there is a numeric character in the front, at the end or in the middle and it shall not match, but it matches. So I started adding different thing in the front or the back, as example (?<![\d]), .*[^\d]?, [^\d]?.*,... but unfortunately my regex knowledge ends soon and the string becomes a mess which I do not understand nor does it work properly.

I made some test strings (each with the desired results) and a simple test function:

import datetime
import re
from typing import Tuple, List

#my_regex = r"(?<![\d])(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
my_regex = r"\b(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"

dt = datetime.datetime(2017, 12, 17, 9, 10, 11)

tests: List[Tuple[str, datetime.datetime]] = [
    # Clean one.
    ("20171217091011", dt),
    # Character in between.
    ("20171217a091011", dt),
    ("20171217b091011", dt),
    ("20171217-091011", dt),
    ("20171217_091011", dt),
    ("20171217 091011", dt),
    ("201712170091011", None),  # Before/in between/at the end in this case.
    # Characters in front.
    ("a20171217091011", dt),
    ("b20171217091011", dt),
    (" 20171217091011", dt),
    ("-20171217091011", dt),
    ("_20171217091011", dt),
    ("020171217091011", None),
    ("aa20171217091011", dt),
    ("a1-20171217091011", dt),
    ("123_20171217091011", dt),
    ("123 20171217091011", dt),
    ("123=20171217091011", dt),
    ("201720171217091011", None),
    # Characters at the end.
    ("20171217091011a", dt),
    ("20171217091011b", dt),
    ("20171217091011 ", dt),
    ("20171217091011-", dt),
    ("20171217091011_", dt),
    ("201712170910110", None),
    ("20171217091011aa", dt),
    ("20171217091011a1", dt),
    ("20171217091011-a1", dt),
    ("20171217091011-123", dt),
    ("20171217091011_123", dt),
    ("20171217091011 123", dt),
    ("20171217091011?123", dt),
    # Characters at both ends.
    ("a20171217091011a", dt),
    ("(20171217091011)", dt),
    ("a-20171217091011 b", dt),
    ("123(20171217091011)456", dt),
    (" 20171217091011 ", dt),
    ("2017 20171217091011 2017", dt),
    ("20171218-20171217091011-070809", dt),
    # Characters at both ends and in the middle.
    ("123(20171217-091011)456", dt),
    ("a2017(20171217 091011)b", dt),
    ("2017xx(20171217?091011)cc2017", dt),
    ("2017xx(201712170091011)cc2017", None),
    ("2017xx(201712170091011", None),
    # Other cases.
    ("20171217091011 20171116080910", dt),  # Match first.
    ("A-20171116-080910-20171217091011", datetime.datetime(2017, 11, 16, 8, 9, 10)),  # Match first.
]

for test_str, test_time in tests:
    match = re.match(my_regex, test_str)
    time = None
    if match:
        try:
            time = datetime.datetime.strptime("".join(match.groups()), "%Y%m%d%H%M%S")
        except ValueError:
            pass
    if time != test_time:
        print("{: <32s} = {} instead of {}".format(test_str, time, test_time))

But I just cannot get all of the test strings to pass, as example:

a20171217091011                  = None instead of 2017-12-17 09:10:11
b20171217091011                  = None instead of 2017-12-17 09:10:11
 20171217091011                  = None instead of 2017-12-17 09:10:11
-20171217091011                  = None instead of 2017-12-17 09:10:11
_20171217091011                  = None instead of 2017-12-17 09:10:11
aa20171217091011                 = None instead of 2017-12-17 09:10:11
a1-20171217091011                = None instead of 2017-12-17 09:10:11
123_20171217091011               = None instead of 2017-12-17 09:10:11
123 20171217091011               = None instead of 2017-12-17 09:10:11
123=20171217091011               = None instead of 2017-12-17 09:10:11
201712170910110                  = 2017-12-17 09:10:11 instead of None
a20171217091011a                 = None instead of 2017-12-17 09:10:11
(20171217091011)                 = None instead of 2017-12-17 09:10:11
a-20171217091011 b               = None instead of 2017-12-17 09:10:11
123(20171217091011)456           = None instead of 2017-12-17 09:10:11
 20171217091011                  = None instead of 2017-12-17 09:10:11
2017 20171217091011 2017         = None instead of 2017-12-17 09:10:11
20171218-20171217091011-070809   = 2017-12-18 20:17:12 instead of 2017-12-17 09:10:11
123(20171217-091011)456          = None instead of 2017-12-17 09:10:11
a2017(20171217 091011)b          = None instead of 2017-12-17 09:10:11
2017xx(20171217?091011)cc2017    = None instead of 2017-12-17 09:10:11
A-20171116-080910-20171217091011 = None instead of 2017-11-16 08:09:10

Thank you for any ideas.

Upvotes: 4

Views: 2798

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627100

It seems that you need to check the general pattern with the regex while validating actual date time values with the appropriate Python methods.

So, you may fix the code using the following regex:

r'(?<!\d)20\d{6}\D?\d{6}(?!\d)'

See the regex demo

Details

  • (?<!\d) - a negative lookbehind that fails the match if there is a digit immediately to the left of the current position
  • 20 - a 20 substring
  • \d{6} - any 6 digits
  • \D? - 1 or 0 non-digit chars
  • \d{6} - any 6 digits
  • (?!\d) - a negative lookahead that fails the match if there is a digit immediately to the right of the current position.

Upvotes: 2

Related Questions