kevintich
kevintich

Reputation: 51

Ignore integer within regex match

I'm creating a regex to process the line below as read from a file.

30/05/2014 17:58:19 418087******2093 No415000345536 5,000.00

I have successfully created the regex but my issue is that the string may sometimes appear as below with a slight addition (bold highlight)

31/05/2014 15:06:29 410741******7993 0027200004750 No415100345732 1,500.00

Please assist in altering the pattern to ignore the integer of 13 digits that I don't need.

Below is my regex pattern

((?:(?:[0-2]?\d{1})|(?:[3][01]{1}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3})))(?![\d])(\s+)((?:(?:[0-1][0-9])|(?:[2][0-3])|(?:[0-9])):(?:[0-5][0-9])(?::[0-5][0-9])?(?:\s?(?:am|AM|pm|PM))?)(\s+)(\d{6})(\*{6})(\d{4})(\s+)(No)(\d+)(\s+)([+-]?\d*\.\d+)(?![-+0-9\.])

Advice and contribution will be highly appreciated.

Upvotes: 0

Views: 346

Answers (1)

Mofi
Mofi

Reputation: 49097

The regular expression in question was most likely created using a regular expression builder.

Here is your regular expression reduced to its component parts, simplified and with support for both variants of valid strings.

  1. Date with a not complete validation (invalid days in month still possible):

    (?:0?[1-9]|[12]\d|3[01])[-:\/.](?:0?[1-9]|1[012])[-:\/.](?:19|20)\d\d
    
  2. Whitespace(s) between date and time:

    [\t ]+
    

    \s matches also newline characters and other not often used whitespaces which is the reason why I'm using [\t ]+ instead of \s.

  3. Time with at least hour and minute with a not complete validation (leap second, AM or PM with invalid hour):

    (?:[01]?\d|2[0-3]):[0-5][0-9](?::[0-5][0-9])?(?:[\t ]?(?:am|AM|pm|PM))?
    
  4. Whitespace(s), number with 4 digits, 6 asterisk, number with 4 digits, whitespace(s):

    [\t ]+\d{6}\*{6}\d{4}[\t ]+
    
  5. Optionally a number with 13 digits not marked for backreferencing:

    (?:\d{13}[\t ]+)?
    
  6. Number with undetermined number of digits, whitespace(s), optional plus or minus sign, floating point number (without exponent):

    No\d+[\t ]+[+-]?[\d,.]+
    

And here is the entire expression with 2 additionally added pairs of parentheses to mark the strings of real interest for further processing.

((?:0?[1-9]|[12]\d|3[01])[-:\/.](?:0?[1-9]|1[012])[-:\/.](?:19|20)\d\d[\t ]+(?:[01]?\d|2[0-3]):[0-5][0-9](?::[0-5][0-9])?(?:[\t ]?(?:am|AM|pm|PM))?[\t ]+\d{6}\*{6}\d{4}[\t ]+)(?:\d{13}[\t ]+)?(No\d+[\t ]+[+-]?[\d,.]+)

The first marking group matches:

30/05/2014 17:58:19 418087******2093 
31/05/2014 15:06:29 410741******7993 

\1 or $1 can be used to reference this part of entire found string.

The second marking group matches:

No415000345536 5,000.00
No415100345732 1,500.00

\2 or $2 can be used to reference this part of entire found string.

Hint: (...) is a marking group. (?:...) is a non-marking group because of ?: immediately after opening parenthesis.

Upvotes: 2

Related Questions