Reputation: 51
I'm creating a regex to process the line below as read from a file.
30/05/2014 17:58:19 418087******2093 No415000345536 5,000.00
I have successfully created the regex but my issue is that the string may sometimes appear as below with a slight addition (bold highlight)
31/05/2014 15:06:29 410741******7993 0027200004750 No415100345732 1,500.00
Please assist in altering the pattern to ignore the integer of 13 digits that I don't need.
Below is my regex pattern
((?:(?:[0-2]?\d{1})|(?:[3][01]{1}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3})))(?![\d])(\s+)((?:(?:[0-1][0-9])|(?:[2][0-3])|(?:[0-9])):(?:[0-5][0-9])(?::[0-5][0-9])?(?:\s?(?:am|AM|pm|PM))?)(\s+)(\d{6})(\*{6})(\d{4})(\s+)(No)(\d+)(\s+)([+-]?\d*\.\d+)(?![-+0-9\.])
Advice and contribution will be highly appreciated.
Upvotes: 0
Views: 346
Reputation: 49097
The regular expression in question was most likely created using a regular expression builder.
Here is your regular expression reduced to its component parts, simplified and with support for both variants of valid strings.
Date with a not complete validation (invalid days in month still possible):
(?:0?[1-9]|[12]\d|3[01])[-:\/.](?:0?[1-9]|1[012])[-:\/.](?:19|20)\d\d
Whitespace(s) between date and time:
[\t ]+
\s
matches also newline characters and other not often used whitespaces which is the reason why I'm using [\t ]+
instead of \s
.
Time with at least hour and minute with a not complete validation (leap second, AM or PM with invalid hour):
(?:[01]?\d|2[0-3]):[0-5][0-9](?::[0-5][0-9])?(?:[\t ]?(?:am|AM|pm|PM))?
Whitespace(s), number with 4 digits, 6 asterisk, number with 4 digits, whitespace(s):
[\t ]+\d{6}\*{6}\d{4}[\t ]+
Optionally a number with 13 digits not marked for backreferencing:
(?:\d{13}[\t ]+)?
Number with undetermined number of digits, whitespace(s), optional plus or minus sign, floating point number (without exponent):
No\d+[\t ]+[+-]?[\d,.]+
And here is the entire expression with 2 additionally added pairs of parentheses to mark the strings of real interest for further processing.
((?:0?[1-9]|[12]\d|3[01])[-:\/.](?:0?[1-9]|1[012])[-:\/.](?:19|20)\d\d[\t ]+(?:[01]?\d|2[0-3]):[0-5][0-9](?::[0-5][0-9])?(?:[\t ]?(?:am|AM|pm|PM))?[\t ]+\d{6}\*{6}\d{4}[\t ]+)(?:\d{13}[\t ]+)?(No\d+[\t ]+[+-]?[\d,.]+)
The first marking group matches:
30/05/2014 17:58:19 418087******2093
31/05/2014 15:06:29 410741******7993
\1
or $1
can be used to reference this part of entire found string.
The second marking group matches:
No415000345536 5,000.00
No415100345732 1,500.00
\2
or $2
can be used to reference this part of entire found string.
Hint: (
...)
is a marking group. (?:
...)
is a non-marking group because of ?:
immediately after opening parenthesis.
Upvotes: 2