James Cotter
James Cotter

Reputation: 808

Regex date time matching

I am using C#

string content = " 4 marco bob 53 AUSTRIA (Jan. 13, 2012) – McDonald Janruary 15, 2021 July 15, 2923   June 2 2343 7/25/23 08/22/3323";

This should recognice all the dates except "4 marco bob 53" which is obviously not a datetime. However, my rules(below) match it(4 marco bob 53) and I cannot figure out how to avoid matching that(or similar examples).

I am trying to match the string above for all the date times. I wrote 3 rules to match some common date patterns.

eg:

Pattern f0: 5/2/2012

Pattern f2: 3 March 1900 or 3 Mar 1990 or 3 MAR. 1990 etc...

Pattern f3: Jan. 4, 2021 or January 4 2021, etc...

 string f0 = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{2,4})";
 string f1 = "([0-9]{1,2})\\s+([jJ][aA][nN].*?|[fF][eE][bB].*?|[mM][aA][rR].*?|[aA][pP][rR].*?|[mM][aA][yY].*?|[jJ][uU][nN].*?|[jJ][uU][lL].*?|[aA][uU][gG].*?|[sS][eE][pP].*?|[oO][cC][tT].*?|[nN][oO][vV[.*?|[dD][eE][cC].*?)\\s+([0-9]{2,4})";
 string f2 = "([jJ][aA][nN].*?|[fF][eE][bB].*?|[mM][aA][rR].*?|[aA][pP][rR].*?|[mM][aA][yY].*?|[jJ][uU][nN].*?|[jJ][uU][lL].*?|[aA][uU][gG].*?|[sS][eE][pP].*?|[oO][cC][tT].*?|[nN][oO][vV[.*?|[dD][eE][cC].*?)\\s+([0-9]{1,2})[\\s,]+([0-9]{2,4})";

I am new to regex, so I am sure I am doing some silly stuff(like not using case insensitive options etc), so let me know how i can improve this as well.

This is for learning regex, not learning how to use library functions....

Upvotes: 2

Views: 4346

Answers (4)

Scott Weaver
Scott Weaver

Reputation: 7361

Addressing the named-month-pattern only: this combines 2 and 3, and would require one more step to fix the last match here: 89 Febuary 12, 2099, but could be separated up pretty easily if you wish to do it that way:

    string input = " 4 marco bob 53 AUSTRIA (Jan. 13, 2012) – McDonald January 15, 2021 July 15, 2923   June 2 2343 7/25/23 08/22/3323 7 jul 2098 0 Jan 0 fake stuff 89 Febuary 12, 2099 it is a greedy";
    var pattern =
    @"(\d\d?\s)? (?# greedily gather preceding dd)
    (jan(uary)?|feb(uary)?|mar(ch)?|apr(il)?|may|june?|july?|aug(ust)?|sep(tember)?|nov(ember)?|dec(ember)?)
    \.?\s?                
    (\d\d?\b,?\s*)? (?# optional day part)
    \d\d(\d\d)?";           

    var matches = Regex.Matches(input, pattern, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
    string result = string.Empty;
    for (int i = 0; i < matches.Count; i++)
    {
        result += "match " + i + ",value:" + matches[i].Value + "\n";
    }   
    Console.WriteLine(result);

edit: non backtracking was not necessary (remnant of more complicated look-ahead approach)-removed that part.

Upvotes: 2

James Cotter
James Cotter

Reputation: 808

Aggregated some of the answers posted to do what I wanted. This seems to be finding dates in free text reasonably well. Thanks to all posters.

string f0 = "(?:(\\d{1,2})/(\\d{1,2})/(\\d{2,4}))";
string f1 = "(?:(\\s\\d{1,2})\\s+(jan(?:uary){0,1}\\.{0,1}|feb(?:ruary){0,1}\\.{0,1}|mar(?:ch){0,1}\\.{0,1}|apr(?:il){0,1}\\.{0,1}|may\\.{0,1}|jun(?:e){0,1}\\.{0,1}|jul(?:y){0,1}\\.{0,1}|aug(?:ust){0,1}\\.{0,1}|sep(?:tember){0,1}\\.{0,1}|oct(?:ober){0,1}\\.{0,1}|nov(?:ember){0,1}\\.{0,1}|dec(?:ember){0,1}\\.{0,1})\\s+(\\d{2,4}))";
 string f2 = "(?:(jan(?:uary){0,1}\\.{0,1}|feb(?:ruary){0,1}\\.{0,1}|mar(?:ch){0,1}\\.{0,1}|apr(?:il){0,1}\\.{0,1}|may\\.{0,1}|jun(?:e){0,1}\\.{0,1}|jul(?:y){0,1}\\.{0,1}|aug(?:ust){0,1}\\.{0,1}|sep(?:tember){0,1}\\.{0,1}|oct(?:ober){0,1}\\.{0,1}|nov(?:ember){0,1}\\.{0,1}|dec(?:ember){0,1}\\.{0,1})\\s+([0-9]{1,2})[\\s,]+(\\d{2,4}))";

MatchCollection mc = Regex.Matches(content, f0 + "|" + f1 + "|" + f2, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

Upvotes: 2

Anastasiosyal
Anastasiosyal

Reputation: 6626

Your regex matches string f1 for the following reasons:

  • 4 because of ([0-9]{1,2})\\s+
  • mar because of [mM][aA][rR]
  • co bob because of .*?
  • 53 because of \\s+([0-9]{2,4}

Remove your .*? that you have after each month. It means match any character in a non greedy way. So what this does is it checks what your next condition is in your case \\s+([0-9]{2,4} so you match

Upvotes: 2

AlanFoster
AlanFoster

Reputation: 8306

You need to specify which language you are doing this in.

Generally most languages will offer a method of parsing for dates, so using regex for validation yourself is not the answer.

Upvotes: 1

Related Questions