Wynell
Wynell

Reputation: 795

Longest substring that doesn't match a REGEX pattern

I'm preparing for an exam, and here's the task: Given a string which consists of uppercase letters of English alphabet, find the longest substring of it, which doesn't contain QW or WQ.

I know I could do re.split or something like that, but I made it a challenge for me to do it with a "regex matching" expression like len(max(re.findall(...), key=len)) without using split or other methods. Is it even possible?

To find all matching subtrings, I tried this:

list(map(lambda x: x[0], re.findall(r'(((?<!QW|WQ).)+(?!QW|WQ))', text))

But this does match a substring which ends with WQ, for example. How do I fix this?

Let me also clarify something. Let's say the string is WQABCDEFGHQW. The answer in this case is QABCDEFGHQ, as this is exactly the longest substring that doesn't contain WQ or QW.

Upvotes: 1

Views: 379

Answers (4)

Kelly Bundy
Kelly Bundy

Reputation: 27609

.?(?:.(?<!QW|WQ))*

Any single character is ok. Any further character is ok unless it creates QW or WQ.

Demo

Python demo along with two split-solutions:

input:   'AQWBQWC':
find:   ['AQ', 'WBQ', 'WC', '']
split1: ['AQ', 'WBQ', 'WC']
split2: ['AQ', 'WBQ', 'WC']

input:   'AQWBWQC':
find:   ['AQ', 'WBW', 'QC', '']
split1: ['AQ', 'WBW', 'QC']
split2: ['AQ', 'WBW', 'QC']

input:   '':
find:   ['']
split1: ['']
split2: ['']

input:   'A':
find:   ['A', '']
split1: ['A']
split2: ['A']

input:   'Q':
find:   ['Q', '']
split1: ['Q']
split2: ['Q']

input:   'QW':
find:   ['Q', 'W', '']
split1: ['Q', 'W']
split2: ['Q', 'W']

(The extra empty matches don't matter, just like other non-longest matches don't matter. Will get discarded by your max(..., key=len).)

Code (Try it online!):

import re

find  = r'.?(?:.(?<!QW|WQ))*'
split1 = r'(?<=Q)(?=W)|(?<=W)(?=Q)'
split2 = r'(?=.(?<=QW|WQ))'

for s in 'AQWBQWC', 'AQWBWQC', '', 'A', 'Q', 'QW':
    print('input:  ', repr(s) + ':')
    print('find:  ', re.findall(find, s))
    print('split1:', re.split(split1, s))
    print('split2:', re.split(split2, s))
    print()

Upvotes: 2

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You could use that: (?:(?!WQ|QW).(?<!WQ|QW))+

At each position (of the dot) it tests forward and backward if the character isn't a part of WQ or QW.

The key is to test forward before the dot and backward after the dot position.

demo


Other possible pattern: (?:.(?!.?(?<=WQ|QW)))+

This time a lookbehind is inside a negative lookahead but since it is preceded by an optional character .?, it checks the two possible positions of the WQ/QW sequence.

demo


Or to reduce the number of steps an unrolled pattern:
[^WQ]+(?:[QW](?!.?(?<=QW|WQ))[^WQ]*)*|(?:[QW](?!.?(?<=QW|WQ))[^WQ]*)+

But it's a bit long (it's unrolled).

demo

Upvotes: 1

Artyom Vancyan
Artyom Vancyan

Reputation: 5390

You could use

(?:(?!Q(?!W)|(?<!W)Q).)+

Python Example

import re

s = "HAKSDUWQHUPHPSAHFPUAHPUSNFJHJWQHPJQWPHJWQWASDIAS"
print(max(re.findall(r"(?:(?!Q(?!W)|(?<!W)Q).)+", s), key=len))  # 'HUPHPSAHFPUAHPUSNFJHJW'

See the regex demo

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You can use

(?:[A-Z](?<!QW|WQ)(?<!Q(?=W)|W(?=Q)))+

See the regex demo.

Details:

  • (?: - start of a non-capturing group:
    • [A-Z] - an uppercase ASCII letter
    • (?<!QW|WQ) - a negative lookbehind that fails the match if, immediately on the left, there is a QW or WQ substring (i.e. if the [A-Z] matched W (preceded with Q) or Q (preceded with W))
    • (?<!Q(?=W)|W(?=Q)) - a negative lookbehind that fails the match if, immediately on the left, there is a Q immediately followed with W, or W that is immediately followed with Q (i.e. if [A-Z] matched Q and the next char is W, or if [A-Z] matches W and the next char is Q)
  • )+ - end of the group, one or more occurrences.

Another approach:

(?:(?!QW|WQ)[A-Z](?<!QW|WQ))+

See this regex demo. Details:

  • (?: - start of a non-capturing group:
    • (?!QW|WQ) - a negative lookahead that fails the match if, immediately on the right, there is a QW or WQ substring
    • [A-Z] - an uppercase ASCII letter
    • (?<!QW|WQ) - a negative lookbehind that fails the match if, immediately on the left, there is a QW or WQ substring
  • )+ - end of the group, one or more occurrences.

Upvotes: 2

Related Questions