Reputation: 19
My original question was closed for being a duplicate. I disagree with it being a duplicate as this is a different use case looking at regular expression syntax. I have tried to clarify my question below.
Is it possible to create a regular expression which matches two duplicate consecutive characters within a string (in this example lowercase letters) but does not match a section of the string if the same characters are either side. e.g. match 'aa'
but not 'aaa'
or 'aaaa'
?
Additionally:
Although I am using Python 3.10 I am trying to work out if this is possible using 'standard' regular expression syntax without utilising additional functionality provided by external modules. For example using Python this would mean a solution which uses the 're'
module from the standard library.
If there are 3 or more duplicate consecutive characters, the string should still match if there are two duplicate consecutive characters elsewhere in the sting. e.g match 'aa'
even if 'bbb'
exists elsewhere in the string.
The string should also match if the two duplicate consecutive characters appear at the beginning or end of the string.
My examples are 16 character strings if a specific length makes a difference.
ffumlmqwfcsyqpss
should match either 'ff'
or 'ss'
.
zztdcqzqddaazdjp
should match either 'zz'
,'dd'
, 'aa'
.
urrvucyrzzzooxhx
should match 'rr'
or 'oo'
even though 'zzz'
exists in the string.
zettygjpcoedwyio
should match 'tt'
.
dtfkgggvqadhqbwb
should not match 'ggg'
.
rwgwbwzebsnjmtln
should not match.
([a-z])\1
to capture the duplicate character but this also matches when there are additional duplicate characters such as 'aaa'
or 'aaaa'
etc.
([a-z])\1(?!\1)
to negate the third duplicate character but this just moves the match to the end of the duplicate character string.
Negative lookarounds to compensate for a match at the beginning but I think I am causing some kind of loop which will never match.
>>>import re
>>>re.search(r'([a-z])\1(?!\1)', 'dtfkgggvqadhqbwb')
<re.Match object; span=(5, 7), match='gg'> # should not match as 'gg' ('[gg]g' or 'g[gg]')
Wiktor Stribiżew's solution uses the additional (*SKIP)
functionality of the external python regex module.
Tim Biegeleisen's solution does not match duplicate pairs if there are duplicate triples etc in the same string.
In the linked question, Cary Swoveland's solutions do not work for duplicate pairs at the beginning or end of a string or match even when there is no duplicate in the string.
In the linked question, the fourth bird's solution does not match duplicate pairs at the beginning or end of strings.
So far the only answer which works is Wiktor Stribiżew's but this uses the (*SKIP)
function of the external 'regex' module. Is a solution not possible using 'standard' regular expression syntax?
Upvotes: 1
Views: 2032
Reputation: 626903
In Python re
, the main problem with creating the right regex for this task is the fact that you need to define the capturing group before using a backreference to the group, and negative lookbehinds are usually placed before the captured pattern. Also, regex101.com Python testing option is not always reflecting the current state of affairs in the re
library, and it confuses users with the message like "This token can not be used in a lookbehind due to either making it non-fixed width or interfering with the pattern matching" when it sees a \1
in (?<!\1)
, while Python allows this since v3.5 for groups of fixed length.
The pattern you can use here is
(.)(?<!\1.)\1(?!\1)
See the regex demo.
Details
(.)
- Capturing group 1: any single char (if re.DOTALL
is used, even line break chars)(?<!\1.)
- a negative lookbehind that fails the match if there is the same char as captured in Group 1 and then any single char (we can use \1
instead of the .
here, and it will work the same) immediately to the left of the current location\1
- same char as in Group 1(?!\1)
- a negative lookahead that fails the match if there is the same char as in Group 1 immediately to the right of the current location.See the Python test:
import re
tests ={'ffumlmqwfcsyqpss': ['ff','ss'],
'zztdcqzqddaazdjp': ['zz','dd', 'aa'],
'urrvucyrzzzooxhx': ['rr','oo'],
'zettygjpcoedwyio': ['tt'],
'dtfkgggvqadhqbwb': [],
'rwgwbwzebsnjmtln': []
}
for test, answer in tests.items():
matches = [m.group() for m in re.finditer(r'(.)(?<!\1.)\1(?!\1)', test, re.DOTALL)]
if matches:
print(f"Matches found in '{test}': {matches}. Is the answer expected? {set(matches)==set(answer)}.")
else:
print(f"No match found in '{test}'. Is the answer expected? {set(matches)==set(answer)}.")
Output:
Matches found in 'ffumlmqwfcsyqpss': ['ff', 'ss']. Is the answer expected? True.
Matches found in 'zztdcqzqddaazdjp': ['zz', 'dd', 'aa']. Is the answer expected? True.
Matches found in 'urrvucyrzzzooxhx': ['rr', 'oo']. Is the answer expected? True.
Matches found in 'zettygjpcoedwyio': ['tt']. Is the answer expected? True.
No match found in 'dtfkgggvqadhqbwb'. Is the answer expected? True.
No match found in 'rwgwbwzebsnjmtln'. Is the answer expected? True.
Upvotes: 1
Reputation: 521419
You may use the following regex pattern:
^(?![a-z]*([a-z])\1{2,})[a-z]*([a-z])\2[a-z]*$
This pattern says to match:
^ start of the string
(?![a-z]*([a-z])\1{2,}) same letter does not occur 3 times or more
[a-z]* zero or more letters
([a-z]) capture a letter
\2 which is followed by the same letter
[a-z]* zero or more letters
$ end of the string
Upvotes: 0