Tom
Tom

Reputation: 1

Complex Regex finding date and time

Is there someone to help me with the following:

I'm trying to find specific date and time strings in a text (to be used within VBA Word). Currently working with the following RegEx string:

(?:([0-9]{1,2})[ |-])?(?:(jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|jun(?:i)?|jul(?:i)?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?))?(?: |-)?(?(3)(?: around | at | ))?(?:([0-9]{1,2}:[0-9]{1,2})?(?: uur| u|u)?)?

Tested output on following text:

  1. date with around time: 26 sep 2016 around 09:00u
  2. date with at time: 1 sep 2016 at 09:00 uur
  3. date and time u: 1 sep 2018 09:00 u
  4. time without date: 08:30 uur
  5. date with time u: 1 sep 2016 at 09:00u
  6. only time: 09:00
  7. only month: jan
  8. month and year: feb 2019
  9. only day: 02
  10. only day with '-': 2-
  11. day and month: 2 jan
  12. month year: jan 2018
  13. date with '-': 2-feb-2018 09:00
  14. other month: 01 sept 2016
  15. full month: 1 september 2018
  16. shortened year: jul '18

Rules:

example at: [https://regex101.com/r/6CFgBP/1/]

Expected output (when using in VBA Word): An regex Matches collection object in which each Match.SubMatches contains the individual items d, m, y, hh:mm from the capture groups in the regex search string. So for example 1: the Submatches (or capture groups) contains values: '26' ','sep','2016','09:00'

The RegEx works fine, but some false-positives need to be excluded:

(I was trying with som lookahead and reference \1 and ?(1), but was not able to get it running properly...)

Any advice highly appreciated!

Upvotes: 0

Views: 311

Answers (2)

Tom
Tom

Reputation: 1

Finally I found something that helps me using the month properly :-)

\b(?:([1-3]|[0-3]\d)[ |-](?'month'(?:[1-9]|\d[12])|(?:jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|jun(?:i)?|jul(?:i)?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?))?)?(?:(\g'month')[ |-]((?:19|20|\')(?:\d{2})))?\b(?: omstreeks | om | )?(?:(\d{1,2}[:]\d{2}(?: uur|u)?|[0-2]\d{3}(?: uur|u)))?\b

It uses a named constructor/subroutine. Found here: https://www.regular-expressions.info/subroutine.html

Upvotes: 0

Valdi_Bo
Valdi_Bo

Reputation: 30971

As I understood, you require that each date/time part (day, month, year, hour and minute) must be present.

So you should remove ? after relevant groups (they are not optional).

It is also a good practice to have each group captured as a relevant capturing group.

There is no need to write something like jun(?:i)?. It is enough (and easier to read) when you write just juni? (the ? refers just to preceding i).

Another hint: As the regex language contains \d char class, use just it instead of [0-9] (the regex is shorter and easier to read.

Optional parts (at / around) should be an optional and non-capturing group.

Anything after the minute part is not needed in the regex.

So I propose a regex like below (for readability, I divided it into rows):

(\d{1,2})[ -](jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|juni?
|juli?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?)
[ -](\d{4}) (?:around |at )?(\d{1,2}:\d{1,2})

Details:

  • (\d{1,2}) - Day.
  • [ -] - A separator after the day (either a space or a minus).
  • (jan(?:uari)?|...dec(?:ember)?) - Month.
  • [ -] - A separator after the month.
  • (\d{4}) - year.
  • (?:around |at )? - Actually, 3 variants of a separator between year and hour (space / around / at), note the space before (...)?.
  • (\d{1,2}:\d{1,2}) - Hour and minute.

It matches variants 1, 2, 3, 5 and 13. All remaining fail to contain each required part, so they are not matched.

If you allow e.g. that the hour/minute part is optional, change the respective fragment into:

( (?:around |at )?(\d{1,2}:\d{1,2}))?

i.e. surround the space/around/at / hour / minute part with ( and )?, making this part an optional group. Then, variants 14 and 15 will also be matched.

One more extension: If you also allow the hour/minute part alone, add |(\d{1,2}:\d{1,2}) to the regex (all before is the first variant and the added part is the second variant for just hour/minute.

Then, your variants No 4 and 6 will also be matched.

For a working example see https://regex101.com/r/33t1ps/1

Edit

Following your list of rules, I propose the following regex:

  • (\d{1,2}[ -])? - Day + separator, optional.
  • (jan(?:uari)?|...|dec(?:ember)?) - Month.
  • (?:[ -](\d{4}|'\d{2}))? - Separator + year (either 4 or 2 digits with "'").
  • ( (?:around |at )?(\d{1,2}:\d{1,2}))? - Separator + hour/minute - optional end of variant 1.
  • |(\d{1,2}:\d{1,2}) - Variant 2 - only hour and minute.

It does not match only your variants No 9 and 10.

For full regex, including also "uur" see https://regex101.com/r/33t1ps/3

Upvotes: 0

Related Questions