Reputation: 19

Regular Expression which matches two duplicate consecutive characters within string but not three or more. Should match if both 'aa' and 'bbb' exist

My original question was closed for being a duplicate. I disagree with it being a duplicate as this is a different use case looking at regular expression syntax. I have tried to clarify my question below.

Is it possible to create a regular expression which matches two duplicate consecutive characters within a string (in this example lowercase letters) but does not match a section of the string if the same characters are either side. e.g. match 'aa' but not 'aaa' or 'aaaa'?

Additionally:

Although I am using Python 3.10 I am trying to work out if this is possible using 'standard' regular expression syntax without utilising additional functionality provided by external modules. For example using Python this would mean a solution which uses the 're' module from the standard library.
If there are 3 or more duplicate consecutive characters, the string should still match if there are two duplicate consecutive characters elsewhere in the sting. e.g match 'aa' even if 'bbb' exists elsewhere in the string.
The string should also match if the two duplicate consecutive characters appear at the beginning or end of the string.
My examples are 16 character strings if a specific length makes a difference.

Examples:

ffumlmqwfcsyqpss should match either 'ff' or 'ss'.

zztdcqzqddaazdjp should match either 'zz','dd', 'aa'.

urrvucyrzzzooxhx should match 'rr' or 'oo' even though 'zzz' exists in the string.

zettygjpcoedwyio should match 'tt'.

dtfkgggvqadhqbwb should not match 'ggg'.

rwgwbwzebsnjmtln should not match.

What I had originally tried

([a-z])\1 to capture the duplicate character but this also matches when there are additional duplicate characters such as 'aaa' or 'aaaa' etc.

([a-z])\1(?!\1) to negate the third duplicate character but this just moves the match to the end of the duplicate character string.

Negative lookarounds to compensate for a match at the beginning but I think I am causing some kind of loop which will never match.

>>>import re

>>>re.search(r'([a-z])\1(?!\1)', 'dtfkgggvqadhqbwb')
<re.Match object; span=(5, 7), match='gg'> # should not match as 'gg' ('[gg]g' or 'g[gg]')

Currently offered solutions don't match described criteria.

Wiktor Stribiżew's solution uses the additional (*SKIP) functionality of the external python regex module.
Tim Biegeleisen's solution does not match duplicate pairs if there are duplicate triples etc in the same string.
In the linked question, Cary Swoveland's solutions do not work for duplicate pairs at the beginning or end of a string or match even when there is no duplicate in the string.
In the linked question, the fourth bird's solution does not match duplicate pairs at the beginning or end of strings.

Summary

So far the only answer which works is Wiktor Stribiżew's but this uses the (*SKIP) function of the external 'regex' module. Is a solution not possible using 'standard' regular expression syntax?

Upvotes: 1

Answers (2)

Wiktor Stribiżew

Reputation: 626903

In Python re, the main problem with creating the right regex for this task is the fact that you need to define the capturing group before using a backreference to the group, and negative lookbehinds are usually placed before the captured pattern. Also, regex101.com Python testing option is not always reflecting the current state of affairs in the re library, and it confuses users with the message like "This token can not be used in a lookbehind due to either making it non-fixed width or interfering with the pattern matching" when it sees a \1 in (?<!\1), while Python allows this since v3.5 for groups of fixed length.

The pattern you can use here is

(.)(?<!\1.)\1(?!\1)

See the regex demo.

Details

(.) - Capturing group 1: any single char (if re.DOTALL is used, even line break chars)
(?<!\1.) - a negative lookbehind that fails the match if there is the same char as captured in Group 1 and then any single char (we can use \1 instead of the . here, and it will work the same) immediately to the left of the current location
\1 - same char as in Group 1
(?!\1) - a negative lookahead that fails the match if there is the same char as in Group 1 immediately to the right of the current location.

See the Python test:

import re
tests ={'ffumlmqwfcsyqpss': ['ff','ss'],
    'zztdcqzqddaazdjp': ['zz','dd', 'aa'],
    'urrvucyrzzzooxhx': ['rr','oo'],
    'zettygjpcoedwyio': ['tt'],
    'dtfkgggvqadhqbwb': [],
    'rwgwbwzebsnjmtln': []
}


for test, answer in tests.items():
    matches = [m.group() for m in re.finditer(r'(.)(?<!\1.)\1(?!\1)', test, re.DOTALL)]
    if matches:
        print(f"Matches found in '{test}': {matches}. Is the answer expected? {set(matches)==set(answer)}.")
    else:
        print(f"No match found in '{test}'. Is the answer expected? {set(matches)==set(answer)}.")

Output:

Matches found in 'ffumlmqwfcsyqpss': ['ff', 'ss']. Is the answer expected? True.
Matches found in 'zztdcqzqddaazdjp': ['zz', 'dd', 'aa']. Is the answer expected? True.
Matches found in 'urrvucyrzzzooxhx': ['rr', 'oo']. Is the answer expected? True.
Matches found in 'zettygjpcoedwyio': ['tt']. Is the answer expected? True.
No match found in 'dtfkgggvqadhqbwb'. Is the answer expected? True.
No match found in 'rwgwbwzebsnjmtln'. Is the answer expected? True.

Upvotes: 1

Tim Biegeleisen

Reputation: 521419

You may use the following regex pattern:

^(?![a-z]*([a-z])\1{2,})[a-z]*([a-z])\2[a-z]*$

Demo

This pattern says to match:

^                        start of the string
(?![a-z]*([a-z])\1{2,})  same letter does not occur 3 times or more
[a-z]*                   zero or more letters
([a-z])                  capture a letter
\2                       which is followed by the same letter
[a-z]*                   zero or more letters
$                        end of the string

Upvotes: 0

Regular Expression which matches two duplicate consecutive characters within string but not three or more. Should match if both &#39;aa&#39; and &#39;bbb&#39; exist