Dirty Penguin
Dirty Penguin

Reputation: 4402

Regex: Correctly matching groups with negative lookback

I'm working with this string:

qr/I Love Chocolate|And Free Shipping|All (day|night)|please/i;

I'm using the following regex pattern:

(?:qr\/)?(.*?)(?:\||\/)

I'd like to get the following matches back:

["I Love Chocolate", "And Free Shipping", "All (day|night)", "please"]

However, this is what I actually get back:

["I Love Chocolate", "And Free Shipping", "All (day", "night)", "please"]

I modified my regex to use a lookback:

(?:qr\/)?(?<!All \(day|night\))(.*?)(?:\||\/)

However, this still splits the string into All (day and night).

How do I adjust the regex so that instead of capturing All (day and night) as individual strings, I instead get All (day|night)?

More generally, the goal here in muggle-speak would be: "Find any groups delimited by the pipe character, unless the group contains 1 or more pipe characters surrounded by ellipses; in which case, treat that entire string as one group."

Upvotes: 3

Views: 128

Answers (2)

anubhava
anubhava

Reputation: 785276

You can use this regex for matching:

[^/|(]+(?:\([^)]*\))*

Code:

>>> str = 'qr/I Love Chocolate|And Free Shipping|All (day|night)|please/i'
>>> print re.findall(r'[^/|(]+(?:\([^)]*\))*', str)
['qr', 'I Love Chocolate', 'And Free Shipping', 'All (day|night)', 'please', 'i']

Or if you want to discard qr/ at start and /i in the end then use:

>>> print re.findall(r'[^/|(]+(?:\([^)]*\))*', re.sub(r'^qr/(.*)/i$', r'\1', str))
['I Love Chocolate', 'And Free Shipping', 'All (day|night)', 'please']

RegEx Demo

Upvotes: 3

alecxe
alecxe

Reputation: 473893

If it is just about day and night words around | specifically, you can use negative lookbehind and negative lookahead:

>>> re.split(r"(?<!day)\|(?!night)", s)
['qr/I Love Chocolate', 'And Free Shipping', 'All (day|night)', 'please/i;']

I'd also remove the qr/ prefix and /i postfix beforehand to keep the split expression simple. For example, this way:

>>> s = "qr/I Love Chocolate|And Free Shipping|All (day|night)|please/i;"
>>> s = re.sub(r"^[a-z]+/", "", s)
>>> s = re.sub(r"/[a-z]+;$", "", s)

Then, split:

>>> re.split(r"(?<!day)\|(?!night)", s)
['I Love Chocolate', 'And Free Shipping', 'All (day|night)', 'please']

Upvotes: 2

Related Questions