Reputation: 961
I am stumbled on a regular expression and unable to fix it after trying several different ways.
Here is the link to the RegEx with sample input and below is the RegEx and Sample text for quick reference:
Regex:
((XYZ|xyz|Xyz|Convent){1}(\s{0,})(INTERNATIONAL|International|international|INC|Inc|inc|LLC|Llc|llc|WORLDWIDE PTE\. LTD\.){0,})([\r\n]|[\t]|[ ]{2,})(?<CircuitID>[a-zA-Z0-9\-\/\ ]{6,26})\s*
Sample input:
Circuits Affected:
Company Name Circuit ID Z End Billing ID ServiceType
XYZ INTERNATIONAL A1B101012 ABCD-XYZ-ABG089AB A000123456 UNI
XYZ INTERNATIONAL AB/PQRS/012345/ /ABC / PQRSTUVW ABCDEFGH CO ADDRESS CITY 1234 S RIDGELINE BLVD VA SOMECITY 12345 SOME BRANCH PKWY A0123456 N/A 0:00 CDT - 1:00 CDT
________________________________________
XYZ INTERNATIONAL AB/ABYX/271703/ /ABC / ABCDO47HBS ABCDEFG71 N/A N/A A0123456 N/A 0:00 CDT - 1:00 CDT
________________________________________
XYZ WORLDWIDE PTE. LTD. A1234597 N/A N/A N/A N/A A0123456 abcde-randomashb-12234568 0:00 CDT - 1:00 CDT
________________________________________
Convent Inc A123 4599 N/A N/A N/A N/A A0123456 abcde-randomhigh-00012345 0:00 CDT - 1:00 CDT
________________________________________
XYZ INTERNATIONAL B124565 N/A N/A N/A N/A A0123456 abcde-randomashb-12234568 0:00 CDT - 1:00 CDT
________________________________________
XYZ INC AB/CDEF/123455/ /ABC / ABCDEFGH PQRSTUVW N/A N/A A0123456 N/A 0:00 CDT - 1:00 CDT
________________________________________
What I am looking for is to capture the values from Circuit IDs column. It works well if there is a tab character after the circuit ID. But sometimes there are multiple spaces after Circuit ID, instead of tab character. That time it captures wrong circuit ID by capturing some extra characters from the next column.
In order to get around this situation, I need to modify the capture group in such that, it will consider the text only if there is only one space in the Circuit ID.
In below screen shot it should not capture the data underlined in red. The ones underlined in Green are OK:
Upvotes: 5
Views: 1776
Reputation: 110675
The main task is to establish the rules for identifying the Circuit ID. Once that is done it can be determined if those strings can be extracted with the use of a regular expression.
After having posited rules for identifying the Circuit ID my objective was to construct a regular expression that avoided the need to enumerate company names in the regex, considering that attempts to match those names could be problematic due to misspellings or other variations.
Let's first look at the beginnings of the example strings that contain what I understand to be the circuit ids, which I've indicated by ^^^^^
.
XYZ INTERNATIONAL A1B101012 AB ...<tab>...
^^^^^^^^^
XYZ INTERNATIONAL<tab>AB/PQRS/012345/ /ABC /<tab>PQR...
^^^^^^^^^^^^^^^^^^^^^^
XYZ INTERNATIONAL<tab>AB/ABYX/271703/ /ABC /<tab>ABC...
^^^^^^^^^^^^^^^^^^^^^^
XYZ WORLDWIDE PTE. LTD.<tab>A1234597<tab>N/A...
^^^^^^^^
Convent Inc<tab>A123 4599<tab>N/A...
^^^^^^^^
XYZ INTERNATIONAL<tab>B124565<tab>N/A...
^^^^^^^
XYZ INC<tab>WA/OGGS/186815/ /ABC /<tab>ABC...
^^^^^^^^^^^^^^^^^^^^^^
The triple dots are placeholders for additional text text.
On the basis of this limited sample I will posit a rule for identifying the circuit id string. If my assumptions are incorrect the OP needs to clarify the rule.
I assume that a circuit id string is:
A string may have multiple substrings that satisfy one of these two requirements, but it appears from the data that, if there is at least one match, the first match will return the circuit id string. Therefore, all matches after the first are to be ignored.
The rule I have set out may of course have to modified as necessary by the OP, necessitating a change to the regular expression I will suggest using, which is as follows.
(?<= )(?=.{0,3}[A-Z])[A-Z\d]{2,4}\d{4,8}(?= )|(?<=\t).*?(?=\t)
The regular expression can be broken down as follows.
(?<= ) # positive lookbehind asserts previous character is a space
(?= # begin positive lookahead
.{0,3}[A-Z] # match 0-3 characters followed by a capital letter
) # end positive lookahead
[A-Z\d]{2,4} # match 2-4 characters from the character class
\d{4,8} # match 4-8 digits
(?= ) # positive lookahead asserts next character is a space
| # or
(?<=\t) # positive lookbehind asserts previous character is a tab
.*?
(?=\t) # positive lookahead asserts next character is a tab
Upvotes: 2
Reputation: 672
You can match single spaces by editing your CircuitID part to either match a space character that isn't followed by another space character (?! )
(negative lookahead), or one of the non-space characters [a-zA-Z0-9\-\/]
.
so the CircuitID part becomes (?<CircuitID>(?:[a-zA-Z0-9\-\/]| (?! )){6,26})
Upvotes: 3