regex with optional block of text

Question

I'm using regex to parse structured text as below, with caret symbol marking what I'm trying to match:

block 1
^^^^^^^
    subblock 1.1
        attrib a=a1
    subblock 1.2
        attrib b=b1
                 ^^
block 2
    subblock 2.1
        attrib a=a2
block 3
^^^^^^^
    subblock 3.1
        attrib a=a3
    subblock 3.2
        attrib b=b3
                 ^^

A subblock may or may not appear inside a block, e.g.: subblock 2.2.

The expected match is [(block1,b1), (block3,b3)].

/(capture block#)[\s\S]*?attrib\sb=(capture b#)/gm

But this ends up matching [(block1, b1), (block2, b3)].

Where am I doing the regex wrong?

Wiktor Stribiżew · Accepted Answer

You can use

(?m)(^block\s*\d+).*(?:
(?!block\s*\d).*)*\battrib\s*b=(\w+)

See the regex demo

The regex is based on an unroll the loop technique. Here is an explanation:

(?m) - multiline modifier to make ^ match the beginning of a line
(^block\s*\d+) - match and capture the block + optional whitespace(s) + 1+ digits (Group 1)
.* - matches the rest of the line (as no DOTALL option should be on)
(?: (?!block\s*\d).*)* - match any text after that is not a word block followed with optional whitespace(s) followed with a digit (this way, a boundary is set)
\battrib\s*b=(\w+) - match a whole word attrib followed with 0+ whitespaces, literal b=, and match and capture 1+ alphanumerics or underscore (note: this can be adjusted as per your real data) with (\w+)

Python demo:

import re
p = re.compile(r'(?m)(^block\s*\d+).*(?:
(?!block\s*\d).*)*\battrib\s*b=(\w+)')
s = "block 1
    subblock 1.1
        attrib a=a1
    subblock 1.2
        attrib b=b1
block 2
    subblock 2.1
        attrib a=a2
block 3
    subblock 3.1
        attrib a=a3
    subblock 3.2
        attrib b=b3"
print(p.findall(s))

regex with optional block of text

Answers (2)

Related Questions