LKloosterman
LKloosterman

Reputation: 21

Why doesn't this RegEx match anything?

I've been trying for about two hours now to write a regular expression which matches a single character that's not preceded or followed by the same character.

This is what I've got: (\d)(?<!\1)\1(?!\1); but it doesn't seem to work! (testing at https://regex101.com/r/whnj5M/6)

For example:

In 1111223 I would expect to match the 3 at the end, since it's not preceded or followed by another 3.

In 1151223 I would expect to match the 5 in the middle, and the 3 at the end for the same reasons as above.

The end goal for this is to be able to find pairs (and only pairs) of characters in strings (e.g. to find 11 in 112223 or 44 in 123544) and I was going to try and match single isolated characters, and then add a {2} to it to find pairs, but I can't even seem to get isolated characters to match!

Any help would be much appreciated, I thought I knew RegEx pretty well!

P.S. I'm testing in JS on regex101.com because it wouldn't let me use variable length lookbacks in Python on there, and I'm using the regex library to allow for this in my actual implementation.

Upvotes: 2

Views: 87

Answers (3)

Cary Swoveland
Cary Swoveland

Reputation: 110675

@Theforthbird has provided a good explanation for why your regular explanation does not match the characters of interest.

Each character matched by the following regular expression is neither preceded nor followed by the same character (including characters at the beginning and end of the string).

r'^.$|^(.)(?!\1)|(?<=(.))(?!\2)(.)(?!\3)'

Demo

Python's re regex engine performs the following operations.

^.$          match the first char if it is the only char in the line
|            or 
^            match beginning of line
(.)          match a char in capture group 1...
(?!\1)       ...that is not followed by the same character 
|            or
(?<=(.))     save the previous char in capture group 2... 
(?!\2)       ...that is not equal to the next char
(.)          match a character and save to capture group 3...
(?!\3)       ...that is not equal to the following char

Suppose the string were "cat".

  • The internal string pointer is initially at the beginning of the line.
  • "c" is not at the end of the line so the first part of the alternation fails and the second part is considered.
  • "c" is matched and saved to capture group 1.
  • The negative lookahead asserting that "c" is not followed by the content of capture group 1 succeeds, so "c" is matched and the internal string pointer is advanced to a position between "c" and "a".
  • "a" fails the first two parts of the assertion so the third part is considered.
  • The positive lookbehind (?<=(.)) saves the preceding character ("c") in capture group 2.
  • The negative lookahead (?!\2), which asserts that the next character ("a") is not equal to the content of capture group 2, succeeds. The string pointer remains just before "a".
  • The next character ("a") is matched and saved in capture group 3.
  • The negative lookahead (?!\3), which asserts that the following character ("t") does not equal the content of capture group 3, succeeds, so "a" is matched and the string pointer advances to just before "t".
  • The same steps are performed when evaluating "t" as were performed when evaluating "a". Here the last token ((?!\3)) succeeds, however, because no characters follow "t".

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163237

The pattern you tried does not match because this part (\d)(?<!\1) can not match.

It reads as:

Capture a digit in group 1. Then, on the position after that captured digit, assert what is captured should not be on the left.

You could make the pattern work by adding for example a dot after the backreference (?<!\1.) to assert that the value before what you have just matched is not the same as group 1

Pattern

 (\d)(?<!\1.)\1(?!\1)

Regex demo | Python demo

Note that you have selected ECMAscript on regex101.

Python re does not support variable width lookbehind.

To make this work in Python, you need the PyPi regex module.

Example code

import regex

pattern = r"(\d)(?<!\1.)\1(?!\1)"

test_str = ("1111223\n"
    "1151223\n\n"
    "112223\n"
    "123544")

matches = regex.finditer(pattern, test_str)

for matchNum, match in enumerate(matches, start=1):
    print(match.group())

Output

22
11
22
11
44

Upvotes: 2

Nick
Nick

Reputation: 147146

Your regex is close, but by using simply (\d) you are consuming characters, which prevents the other match from occurring. Instead, you can use a positive lookahead to set the capture group and then test for any occurrences of the captured digit not being surrounded by copies of itself:

(?=.*?(.))(?<!\1)\1(?!\1)

By using a lookahead you avoid consuming any characters and so the regex can match anywhere in the string.

Note that in 1151223 this returns 5, 1 and 3 because the third 1 is not adjacent to any other 1s.

Demo on regex101 (requires JS that supports variable width lookbehinds)

Upvotes: 2

Related Questions