SavindraSingh
SavindraSingh

Reputation: 961

Regular Expression - Ignore multiple spaces and Consider only one space in the match

I am stumbled on a regular expression and unable to fix it after trying several different ways.

Here is the link to the RegEx with sample input and below is the RegEx and Sample text for quick reference:

Regex:

    ((XYZ|xyz|Xyz|Convent){1}(\s{0,})(INTERNATIONAL|International|international|INC|Inc|inc|LLC|Llc|llc|WORLDWIDE PTE\. LTD\.){0,})([\r\n]|[\t]|[ ]{2,})(?<CircuitID>[a-zA-Z0-9\-\/\ ]{6,26})\s*

Sample input:

Circuits Affected:
Company Name    Circuit ID      Z End   Billing ID      ServiceType
XYZ INTERNATIONAL      A1B101012        ABCD-XYZ-ABG089AB       A000123456      UNI
XYZ INTERNATIONAL   AB/PQRS/012345/ /ABC /  PQRSTUVW    ABCDEFGH    CO ADDRESS CITY 1234 S RIDGELINE BLVD   VA SOMECITY 12345 SOME BRANCH PKWY  A0123456    N/A 0:00 CDT - 1:00 CDT 
________________________________________
XYZ INTERNATIONAL   AB/ABYX/271703/ /ABC /  ABCDO47HBS  ABCDEFG71   N/A N/A A0123456    N/A 0:00 CDT - 1:00 CDT 
________________________________________
XYZ WORLDWIDE PTE. LTD. A1234597    N/A N/A N/A N/A A0123456    abcde-randomashb-12234568   0:00 CDT - 1:00 CDT 
________________________________________
Convent Inc A123 4599   N/A N/A N/A N/A A0123456    abcde-randomhigh-00012345   0:00 CDT - 1:00 CDT 
________________________________________
XYZ INTERNATIONAL   B124565 N/A N/A N/A N/A A0123456    abcde-randomashb-12234568   0:00 CDT - 1:00 CDT 
________________________________________
XYZ INC AB/CDEF/123455/ /ABC /  ABCDEFGH    PQRSTUVW    N/A N/A A0123456    N/A 0:00 CDT - 1:00 CDT 
________________________________________

What I am looking for is to capture the values from Circuit IDs column. It works well if there is a tab character after the circuit ID. But sometimes there are multiple spaces after Circuit ID, instead of tab character. That time it captures wrong circuit ID by capturing some extra characters from the next column.

In order to get around this situation, I need to modify the capture group in such that, it will consider the text only if there is only one space in the Circuit ID.

In below screen shot it should not capture the data underlined in red. The ones underlined in Green are OK:

enter image description here

Upvotes: 5

Views: 1776

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110675

The main task is to establish the rules for identifying the Circuit ID. Once that is done it can be determined if those strings can be extracted with the use of a regular expression.

After having posited rules for identifying the Circuit ID my objective was to construct a regular expression that avoided the need to enumerate company names in the regex, considering that attempts to match those names could be problematic due to misspellings or other variations.

Let's first look at the beginnings of the example strings that contain what I understand to be the circuit ids, which I've indicated by ^^^^^.

XYZ INTERNATIONAL      A1B101012        AB ...<tab>...
                       ^^^^^^^^^
XYZ INTERNATIONAL<tab>AB/PQRS/012345/ /ABC /<tab>PQR...
                      ^^^^^^^^^^^^^^^^^^^^^^
XYZ INTERNATIONAL<tab>AB/ABYX/271703/ /ABC /<tab>ABC...
                      ^^^^^^^^^^^^^^^^^^^^^^
XYZ WORLDWIDE PTE. LTD.<tab>A1234597<tab>N/A...
                            ^^^^^^^^
Convent Inc<tab>A123 4599<tab>N/A...
                ^^^^^^^^
XYZ INTERNATIONAL<tab>B124565<tab>N/A...
                      ^^^^^^^ 
XYZ INC<tab>WA/OGGS/186815/ /ABC /<tab>ABC...
            ^^^^^^^^^^^^^^^^^^^^^^

The triple dots are placeholders for additional text text.

On the basis of this limited sample I will posit a rule for identifying the circuit id string. If my assumptions are incorrect the OP needs to clarify the rule.

I assume that a circuit id string is:

  • comprised of 2-4 uppercase letters and digits, of which there is at least one letter, followed by 4-8 digits, the string being preceded and followed by a space; or
  • the text between the first two tabs in the string.

A string may have multiple substrings that satisfy one of these two requirements, but it appears from the data that, if there is at least one match, the first match will return the circuit id string. Therefore, all matches after the first are to be ignored.

The rule I have set out may of course have to modified as necessary by the OP, necessitating a change to the regular expression I will suggest using, which is as follows.

(?<= )(?=.{0,3}[A-Z])[A-Z\d]{2,4}\d{4,8}(?= )|(?<=\t).*?(?=\t)

Demo

The regular expression can be broken down as follows.

(?<= )        # positive lookbehind asserts previous character is a space
(?=           # begin positive lookahead
  .{0,3}[A-Z] # match 0-3 characters followed by a capital letter
)             # end positive lookahead
[A-Z\d]{2,4}  # match 2-4 characters from the character class
\d{4,8}       # match 4-8 digits
(?= )         # positive lookahead asserts next character is a space
|             # or
(?<=\t)       # positive lookbehind asserts previous character is a tab
.*?
(?=\t)        # positive lookahead asserts next character is  a tab

Upvotes: 2

Onno Rouast
Onno Rouast

Reputation: 672

You can match single spaces by editing your CircuitID part to either match a space character that isn't followed by another space character (?! ) (negative lookahead), or one of the non-space characters [a-zA-Z0-9\-\/].

so the CircuitID part becomes (?<CircuitID>(?:[a-zA-Z0-9\-\/]| (?! )){6,26})

regex101

Upvotes: 3

Related Questions