Reputation: 808
I am using C#
string content = " 4 marco bob 53 AUSTRIA (Jan. 13, 2012) – McDonald Janruary 15, 2021 July 15, 2923 June 2 2343 7/25/23 08/22/3323";
This should recognice all the dates except "4 marco bob 53" which is obviously not a datetime. However, my rules(below) match it(4 marco bob 53) and I cannot figure out how to avoid matching that(or similar examples).
I am trying to match the string above for all the date times. I wrote 3 rules to match some common date patterns.
eg:
Pattern f0: 5/2/2012
Pattern f2: 3 March 1900 or 3 Mar 1990 or 3 MAR. 1990 etc...
Pattern f3: Jan. 4, 2021 or January 4 2021, etc...
string f0 = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{2,4})";
string f1 = "([0-9]{1,2})\\s+([jJ][aA][nN].*?|[fF][eE][bB].*?|[mM][aA][rR].*?|[aA][pP][rR].*?|[mM][aA][yY].*?|[jJ][uU][nN].*?|[jJ][uU][lL].*?|[aA][uU][gG].*?|[sS][eE][pP].*?|[oO][cC][tT].*?|[nN][oO][vV[.*?|[dD][eE][cC].*?)\\s+([0-9]{2,4})";
string f2 = "([jJ][aA][nN].*?|[fF][eE][bB].*?|[mM][aA][rR].*?|[aA][pP][rR].*?|[mM][aA][yY].*?|[jJ][uU][nN].*?|[jJ][uU][lL].*?|[aA][uU][gG].*?|[sS][eE][pP].*?|[oO][cC][tT].*?|[nN][oO][vV[.*?|[dD][eE][cC].*?)\\s+([0-9]{1,2})[\\s,]+([0-9]{2,4})";
I am new to regex, so I am sure I am doing some silly stuff(like not using case insensitive options etc), so let me know how i can improve this as well.
This is for learning regex, not learning how to use library functions....
Upvotes: 2
Views: 4346
Reputation: 7361
Addressing the named-month-pattern only: this combines 2 and 3, and would require one more step to fix the last match here: 89 Febuary 12, 2099
, but could be separated up pretty easily if you wish to do it that way:
string input = " 4 marco bob 53 AUSTRIA (Jan. 13, 2012) – McDonald January 15, 2021 July 15, 2923 June 2 2343 7/25/23 08/22/3323 7 jul 2098 0 Jan 0 fake stuff 89 Febuary 12, 2099 it is a greedy";
var pattern =
@"(\d\d?\s)? (?# greedily gather preceding dd)
(jan(uary)?|feb(uary)?|mar(ch)?|apr(il)?|may|june?|july?|aug(ust)?|sep(tember)?|nov(ember)?|dec(ember)?)
\.?\s?
(\d\d?\b,?\s*)? (?# optional day part)
\d\d(\d\d)?";
var matches = Regex.Matches(input, pattern, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
string result = string.Empty;
for (int i = 0; i < matches.Count; i++)
{
result += "match " + i + ",value:" + matches[i].Value + "\n";
}
Console.WriteLine(result);
edit: non backtracking was not necessary (remnant of more complicated look-ahead approach)-removed that part.
Upvotes: 2
Reputation: 808
Aggregated some of the answers posted to do what I wanted. This seems to be finding dates in free text reasonably well. Thanks to all posters.
string f0 = "(?:(\\d{1,2})/(\\d{1,2})/(\\d{2,4}))";
string f1 = "(?:(\\s\\d{1,2})\\s+(jan(?:uary){0,1}\\.{0,1}|feb(?:ruary){0,1}\\.{0,1}|mar(?:ch){0,1}\\.{0,1}|apr(?:il){0,1}\\.{0,1}|may\\.{0,1}|jun(?:e){0,1}\\.{0,1}|jul(?:y){0,1}\\.{0,1}|aug(?:ust){0,1}\\.{0,1}|sep(?:tember){0,1}\\.{0,1}|oct(?:ober){0,1}\\.{0,1}|nov(?:ember){0,1}\\.{0,1}|dec(?:ember){0,1}\\.{0,1})\\s+(\\d{2,4}))";
string f2 = "(?:(jan(?:uary){0,1}\\.{0,1}|feb(?:ruary){0,1}\\.{0,1}|mar(?:ch){0,1}\\.{0,1}|apr(?:il){0,1}\\.{0,1}|may\\.{0,1}|jun(?:e){0,1}\\.{0,1}|jul(?:y){0,1}\\.{0,1}|aug(?:ust){0,1}\\.{0,1}|sep(?:tember){0,1}\\.{0,1}|oct(?:ober){0,1}\\.{0,1}|nov(?:ember){0,1}\\.{0,1}|dec(?:ember){0,1}\\.{0,1})\\s+([0-9]{1,2})[\\s,]+(\\d{2,4}))";
MatchCollection mc = Regex.Matches(content, f0 + "|" + f1 + "|" + f2, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
Upvotes: 2
Reputation: 6626
Your regex matches string f1 for the following reasons:
4
because of ([0-9]{1,2})\\s+
mar
because of [mM][aA][rR]
co bob
because of .*?
53
because of \\s+([0-9]{2,4}
Remove your .*?
that you have after each month. It means match any character in a non greedy way. So what this does is it checks what your next condition is in your case \\s+([0-9]{2,4}
so you match
Upvotes: 2
Reputation: 8306
You need to specify which language you are doing this in.
Generally most languages will offer a method of parsing for dates, so using regex for validation yourself is not the answer.
Upvotes: 1