Andrea Mario Lufino
Andrea Mario Lufino

Reputation: 7921

Avoiding duplicate items in a comma-separated list of two-letter words

I need to write a regex which allows a group of 2 chars only once. This is my current regex :

^([A-Z]{2},)*([A-Z]{2}){1}$

This allows me to validate something like this :

AL,RA,IS,GD
AL
AL,RA

The problem is that it validates also AL,AL and AL,RA,AL.

EDIT

Here there are more details.

What is allowed:

AL,RA,GD
AL
AL,RA
AL,IS,GD

What it shouldn't be allowed:

AL,RA,AL
AL,AL
AL,RA,RA
AL,IS,AL
IS,IS,AL
IS,GD,GD
IS,GD,IS

I need that every group of two characters appears only once in the sequence.

Upvotes: 2

Views: 211

Answers (3)

Cary Swoveland
Cary Swoveland

Reputation: 110685

You can use a negative lookahead with a back-reference:

^(?!.*([A-Z]{2}).*\1).*

if, as in the all the examples in the question, it is known that the string contains only comma-separated pairs of capital letters. I will relax that assumption later in my answer.

Demo

The regex performs the following operations:

^             # match beginning of line
(?!           # begin negative lookahead
  .*          # match 0+ characters (1+ OK)
  ([A-Z]{2})  # match 2 uppercase letters in capture group 1
  .*          # match 0+ characters (1+ OK)
  \1          # match the contents of capture group 1
)             # end negative lookahead
.*            # match 0+ characters (the entire string)

Suppose now that one or more capital letters may appear between each pair of commas, or before the first comma or after the last comma, but it is only strings of two letters that cannot be repeated. Moreover, I assume the regex must confirm the regex has the desired form. Then the following regex could be used:

^(?=[A-Z]+(?:,[A-Z]+)*$)(?!.*(?:^|,)([A-Z]{2}),(?:.*,)?\1(?:,|$)).*

Demo

The regex performs the following operations:

^             # match beginning of line
(?=           # begin pos lookahead
  [A-Z]+      # match 1+ uc letters
  (?:,[A-Z]+) # match ',' then by 1+ uc letters in a non-cap grp
  *           # execute the non-cap grp 0+ times
  $           # match the end of the line
)             # end pos lookahead
(?!           # begin neg lookahead
  .*          # match 0+ chars
  (?:^|,)     # match beginning of line or ','
  ([A-Z]{2})  # match 2 uc letters in cap grp 1
  ,           # match ','
  (?:.*,)     # match 0+ chars, then ',' in non-cap group
  ?           # optionally match non-cap grp
  \1          # match the contents of cap grp 1
  (?:,|$)     # match ',' or end of line
)             # end neg lookahead
.*            # match 0+ chars (entire string)

If there is no need check that the string contains only comma-separated strings of one or more upper case letters the postive lookahead at the beginning can be removed.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626920

First of all, let's shorten your pattern. It can be easily achieved since the length of each comma-separated item is fixed and the list items are only made up of uppercase ASCII letters. So, your pattern can be written as ^(?:[A-Z]{2}(?:,\b)?)+$. See this regex demo.

Now, you need to add a negative lookahead that will check the string for any repeating two-letter sequence at any distance from the start of string, and within any distance between each. Use

^(?!.*\b([A-Z]{2})\b.*\b\1\b)(?:[A-Z]{2}(?:,\b)?)+$

See the regex demo

Possible implementation in Swift:

func isValidInput(Input:String) -> Bool {
    return Input.range(of: #"^(?!.*\b([A-Z]{2})\b.*\b\1\b)(?:[A-Z]{2}(?:,\b)?)+$"#, options: .regularExpression) != nil
}

print(isValidInput(Input:"AL,RA,GD")) // true
print(isValidInput(Input:"AL,RA,AL")) // false

Details

  • ^ - start of string
  • (?!.*\b([A-Z]{2})\b.*\b\1\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there is:
    • .* - any 0+ chars other than line break chars, as many as possible
    • \b([A-Z]{2})\b - a two-letter word as a whole word
    • .* - any 0+ chars other than line break chars, as many as possible
    • \b\1\b - the same whole word as in Group 1. NOTE: The word boundaries here are not necessary in the current scenario where the word length is fixed, it is two, but if you do not know the word length, and you have [A-Z]+, you will need the word boundaries, or other boundaries depending on the situation
  • (?:[A-Z]{2}(?:,\b)?)+ - 1 or more sequences of:
    • [A-Z]{2} - two uppercase ASCII letters
    • (?:,\b)? - an optional sequence: , only if followed with a word char: letter, digit or _. This guarantees that , won't be allowed at the end of the string
  • $ - end of string.

Upvotes: 3

oriberu
oriberu

Reputation: 1216

Try something like this expression:

/^(?:,?(\b\w{2}\b)(?!.*\1))+$/gm

I have no knowledge of swift, so take it with a grain of salt. The idea is basically to only match a whole line while making sure that no single matched group occurs at a later point in the line.

Upvotes: 3

Related Questions