sdklly
sdklly

Reputation: 19

Regular Expression which matches two duplicate consecutive characters within string but not three or more. Should match if both 'aa' and 'bbb' exist

My original question was closed for being a duplicate. I disagree with it being a duplicate as this is a different use case looking at regular expression syntax. I have tried to clarify my question below.

Is it possible to create a regular expression which matches two duplicate consecutive characters within a string (in this example lowercase letters) but does not match a section of the string if the same characters are either side. e.g. match 'aa' but not 'aaa' or 'aaaa'?

Additionally:

Examples:

ffumlmqwfcsyqpss should match either 'ff' or 'ss'.

zztdcqzqddaazdjp should match either 'zz','dd', 'aa'.

urrvucyrzzzooxhx should match 'rr' or 'oo' even though 'zzz' exists in the string.

zettygjpcoedwyio should match 'tt'.

dtfkgggvqadhqbwb should not match 'ggg'.

rwgwbwzebsnjmtln should not match.

What I had originally tried

([a-z])\1 to capture the duplicate character but this also matches when there are additional duplicate characters such as 'aaa' or 'aaaa' etc.

([a-z])\1(?!\1) to negate the third duplicate character but this just moves the match to the end of the duplicate character string.

Negative lookarounds to compensate for a match at the beginning but I think I am causing some kind of loop which will never match.

>>>import re

>>>re.search(r'([a-z])\1(?!\1)', 'dtfkgggvqadhqbwb')
<re.Match object; span=(5, 7), match='gg'> # should not match as 'gg' ('[gg]g' or 'g[gg]') 

Currently offered solutions don't match described criteria.

Summary

So far the only answer which works is Wiktor Stribiżew's but this uses the (*SKIP) function of the external 'regex' module. Is a solution not possible using 'standard' regular expression syntax?

Upvotes: 1

Views: 2032

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

In Python re, the main problem with creating the right regex for this task is the fact that you need to define the capturing group before using a backreference to the group, and negative lookbehinds are usually placed before the captured pattern. Also, regex101.com Python testing option is not always reflecting the current state of affairs in the re library, and it confuses users with the message like "This token can not be used in a lookbehind due to either making it non-fixed width or interfering with the pattern matching" when it sees a \1 in (?<!\1), while Python allows this since v3.5 for groups of fixed length.

The pattern you can use here is

(.)(?<!\1.)\1(?!\1)

See the regex demo.

Details

  • (.) - Capturing group 1: any single char (if re.DOTALL is used, even line break chars)
  • (?<!\1.) - a negative lookbehind that fails the match if there is the same char as captured in Group 1 and then any single char (we can use \1 instead of the . here, and it will work the same) immediately to the left of the current location
  • \1 - same char as in Group 1
  • (?!\1) - a negative lookahead that fails the match if there is the same char as in Group 1 immediately to the right of the current location.

See the Python test:

import re
tests ={'ffumlmqwfcsyqpss': ['ff','ss'],
    'zztdcqzqddaazdjp': ['zz','dd', 'aa'],
    'urrvucyrzzzooxhx': ['rr','oo'],
    'zettygjpcoedwyio': ['tt'],
    'dtfkgggvqadhqbwb': [],
    'rwgwbwzebsnjmtln': []
}


for test, answer in tests.items():
    matches = [m.group() for m in re.finditer(r'(.)(?<!\1.)\1(?!\1)', test, re.DOTALL)]
    if matches:
        print(f"Matches found in '{test}': {matches}. Is the answer expected? {set(matches)==set(answer)}.")
    else:
        print(f"No match found in '{test}'. Is the answer expected? {set(matches)==set(answer)}.")

Output:

Matches found in 'ffumlmqwfcsyqpss': ['ff', 'ss']. Is the answer expected? True.
Matches found in 'zztdcqzqddaazdjp': ['zz', 'dd', 'aa']. Is the answer expected? True.
Matches found in 'urrvucyrzzzooxhx': ['rr', 'oo']. Is the answer expected? True.
Matches found in 'zettygjpcoedwyio': ['tt']. Is the answer expected? True.
No match found in 'dtfkgggvqadhqbwb'. Is the answer expected? True.
No match found in 'rwgwbwzebsnjmtln'. Is the answer expected? True.

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521419

You may use the following regex pattern:

^(?![a-z]*([a-z])\1{2,})[a-z]*([a-z])\2[a-z]*$

Demo

This pattern says to match:

^                        start of the string
(?![a-z]*([a-z])\1{2,})  same letter does not occur 3 times or more
[a-z]*                   zero or more letters
([a-z])                  capture a letter
\2                       which is followed by the same letter
[a-z]*                   zero or more letters
$                        end of the string

Upvotes: 0

Related Questions