Curtis White
Curtis White

Reputation: 250

Forcing a Strict Character Order in a Regex Expression

I'm trying to create a regex in Javascript that has a limited order the characters can be placed in, but I'm having trouble getting the validation to be fully correct.

The criteria for the expression is a little complicated. The user must input strings with the following criteria:

  1. The string contains two parts, an initial group, and an end group.
  2. The groups are separated by a colon (:).
  3. Strings are separated by a semi-colon (;).
  4. The initial group can start with one optional forward-slash and end with one optional forward-slash, but these forward-slashes may not appear anywhere else in the group.
  5. Inside forward-slashes, one optional underscore may appear on either end, but they may not appear anywhere else in the group.
  6. Inside these optional elements, the user may enter any number of numbers or letters, uppercase or lowercase, but exactly one of these characters must be surrounded with angular brackets (<>).
  7. If the letter inside the brackets is an uppercase C, it may be followed by one of a lowercase u or v.
  8. The end group may contain one or more of a number or letter, uppercase or lowercase (If it is an uppercase C, it can be followed by a lowercase u or v.) or one asterisk (*), but not both.
  9. A string must be able to validate with multiple groupings.

This probably sounds a little confusing.

For example, the following examples are valid:

<C>:Cu;
<Cu>:Cv;
/_V<C>V:C;
/_VV<Cv>VV_/:Cu;
_<V>:V1;
_<V>_:V1;
_<V>/:V1;
_<V>:*;
_<m>:n;

The following are invalid:

Cu:Cv;
Cu:Cv 
CuCv;
<Cu/>:Cv; 
<Cu_>:Cv; 
<Cu>:Cv/;
_/<Cu>:Cv;
<Cu>/_:Cv;

They should validate when grouped together like so.

<Cu>:Cv;/_V<C>V:C;_<V>:V1;_<V>/:V1;_<V>:*;_<m>:n;

Hopefully, these examples help you understand what I'm trying to match.

I created the following regexp and tested it on Regex101.com, but this is the closest I could come:

\\/{0,1}_{0,1}[A-Za-z0-9]{0,}<{1}[A-Za-z0-9]{1,2}>{1}[A-Za-z0-9]{0,}_{0,1}\\/{0,1}):([A-Za-z0-9]{1,2}|\\*;$

It's mostly correct, but it allows strings that should be invalid such as:

_/<C>:C;

If an underscore comes before the first forward-slash, it should be rejected. Otherwise, my regexp seems to be correct for all other cases.

If anyone has any suggestions on how to fix this, or knows of a way to match all criteria much more efficiently, any help is appreciated.

Upvotes: 1

Views: 1461

Answers (2)

Pinke Helga
Pinke Helga

Reputation: 6682

Did you mean this?

/^(?:(^|\s*;\s*)(?:\/_|_)?[a-z]*<[a-z]+>[a-z]*_?\/?:(?:[a-z0-9]+|\*)(?=;))+;$/i

We start with a case-insensitive expression /.../i to keep it more readable. You have to rewrite it to a case-sensitive expression if you only want to allow uppercase at the beginning of a word.

^ means the begin of the string. $ means the end of the string.

The whole string ends with ';' after multiple repeatitions of the inner expression (?:...)+ where + means 1 or more ocurrences. ;$ at the end includes the last semicolon into the result. It is not necessary for a test only, since the look-ahead already does the job.

(^|\s*;\s*) every part is at the begin of the string or after a semicolon surrounded by arbitrary whitespaces including linefeed. Use \n if you do not want to allow spaces and tabs.

(?:...|...) is a non-captured alternative. ? after a character or group is the quantifier 0/1 - none or once.

So (?:\/_|_)? means '/', '' or nothing. Use \/?_? if you do want to allow strings starting with a single slash as well.

[a-z]*<[a-z]+>[a-z]* 0 or more letters followed by <...> with at least one letter inside and again followed by 0 or more letters.

_?\/?: optional '_', optional '/', mandatory : in this sequence.

(?:[a-z0-9]+|\*) The part after the colon contains letters and numbers or the asterisk.

(?=;) Look-ahead: Every group must be followed by a semicolon. Look-ahead conditions do not move the search position.

Upvotes: 1

41686d6564
41686d6564

Reputation: 19641

The following seems to fulfill all the criteria:

(?:^|;)(\/?_?[a-zA-Z0-9]*<(?:[a-zA-Z]|C[uv]?)>[a-zA-Z0-9]*_?\/?):([a-zA-Z0-9]+|\*)(?=;|$)

Regex101 demo.

It puts each of the "groups" in a capturing group so you can access them individually.

Details:

  • (?:^|;) A non-capturing group to make sure the string is either at the beginning or starts with a semicolon.

  • ( Start of group 1.

    • \/?_? An optional forward-slash followed by an optional underscore.

    • [a-zA-Z0-9]* Any letter or number - Matches zero or more.

    • <(?:[a-zA-Z]|C[uv]?)> Mandatory <> pair containing one letter or the capital letter C followed by a lowercase u or v.

    • [a-zA-Z0-9]* Any letter or number - Matches zero or more.

    • _?\/? An optional underscore followed by an optional forward-slash.

  • ) End of group1.

  • : Matches a colon character literally.

  • ([a-zA-Z0-9]+|\*) Group 2 - containing one or more numbers or letters or a single * character.

  • (?=;|$) A positive Lookahead to make sure the string is either followed by a semicolon or is at the end.

Upvotes: 2

Related Questions