Reputation: 21
I've been trying for about two hours now to write a regular expression which matches a single character that's not preceded or followed by the same character.
This is what I've got: (\d)(?<!\1)\1(?!\1)
; but it doesn't seem to work! (testing at https://regex101.com/r/whnj5M/6)
For example:
In 1111223
I would expect to match the 3
at the end, since it's not preceded or followed by another 3.
In 1151223
I would expect to match the 5
in the middle, and the 3
at the end for the same reasons as above.
The end goal for this is to be able to find pairs (and only pairs) of characters in strings (e.g. to find 11
in 112223
or 44
in 123544
) and I was going to try and match single isolated characters, and then add a {2}
to it to find pairs, but I can't even seem to get isolated characters to match!
Any help would be much appreciated, I thought I knew RegEx pretty well!
P.S. I'm testing in JS on regex101.com because it wouldn't let me use variable length lookbacks in Python on there, and I'm using the regex library to allow for this in my actual implementation.
Upvotes: 2
Views: 87
Reputation: 110675
@Theforthbird has provided a good explanation for why your regular explanation does not match the characters of interest.
Each character matched by the following regular expression is neither preceded nor followed by the same character (including characters at the beginning and end of the string).
r'^.$|^(.)(?!\1)|(?<=(.))(?!\2)(.)(?!\3)'
Python's re regex engine performs the following operations.
^.$ match the first char if it is the only char in the line
| or
^ match beginning of line
(.) match a char in capture group 1...
(?!\1) ...that is not followed by the same character
| or
(?<=(.)) save the previous char in capture group 2...
(?!\2) ...that is not equal to the next char
(.) match a character and save to capture group 3...
(?!\3) ...that is not equal to the following char
Suppose the string were "cat"
.
"c"
is not at the end of the line so the first part of the alternation fails and the second part is considered. "c"
is matched and saved to capture group 1."c"
is not followed by the content of capture group 1 succeeds, so "c"
is matched and the internal string pointer is advanced to a position between "c"
and "a"
."a"
fails the first two parts of the assertion so the third part is considered.(?<=(.))
saves the preceding character ("c"
) in capture group 2.(?!\2)
, which asserts that the next character ("a"
) is not equal to the content of capture group 2, succeeds. The string pointer remains just before "a"
."a"
) is matched and saved in capture group 3.(?!\3)
, which asserts that the following character ("t"
) does not equal the content of capture group 3, succeeds, so "a"
is matched and the string pointer advances to just before "t"
."t"
as were performed when evaluating "a"
. Here the last token ((?!\3)
) succeeds, however, because no characters follow "t"
.Upvotes: 2
Reputation: 163237
The pattern you tried does not match because this part (\d)(?<!\1)
can not match.
It reads as:
Capture a digit in group 1. Then, on the position after that captured digit, assert what is captured should not be on the left.
You could make the pattern work by adding for example a dot after the backreference (?<!\1.)
to assert that the value before what you have just matched is not the same as group 1
Pattern
(\d)(?<!\1.)\1(?!\1)
Note that you have selected ECMAscript on regex101.
Python re does not support variable width lookbehind.
To make this work in Python, you need the PyPi regex module.
Example code
import regex
pattern = r"(\d)(?<!\1.)\1(?!\1)"
test_str = ("1111223\n"
"1151223\n\n"
"112223\n"
"123544")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
print(match.group())
Output
22
11
22
11
44
Upvotes: 2
Reputation: 147146
Your regex is close, but by using simply (\d)
you are consuming characters, which prevents the other match from occurring. Instead, you can use a positive lookahead to set the capture group and then test for any occurrences of the captured digit not being surrounded by copies of itself:
(?=.*?(.))(?<!\1)\1(?!\1)
By using a lookahead you avoid consuming any characters and so the regex can match anywhere in the string.
Note that in 1151223
this returns 5
, 1
and 3
because the third 1
is not adjacent to any other 1
s.
Demo on regex101 (requires JS that supports variable width lookbehinds)
Upvotes: 2