Reputation: 575
I'm using regex to parse structured text as below, with caret symbol marking what I'm trying to match:
block 1
^^^^^^^
subblock 1.1
attrib a=a1
subblock 1.2
attrib b=b1
^^
block 2
subblock 2.1
attrib a=a2
block 3
^^^^^^^
subblock 3.1
attrib a=a3
subblock 3.2
attrib b=b3
^^
A subblock may or may not appear inside a block, e.g.: subblock 2.2.
The expected match is [(block1,b1), (block3,b3)].
/(capture block#)[\s\S]*?attrib\sb=(capture b#)/gm
But this ends up matching [(block1, b1), (block2, b3)].
Where am I doing the regex wrong?
Upvotes: 1
Views: 4278
Reputation: 627609
You can use
(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)
See the regex demo
The regex is based on an unroll the loop technique. Here is an explanation:
(?m)
- multiline modifier to make ^
match the beginning of a line(^block\s*\d+)
- match and capture the block
+ optional whitespace(s) + 1+ digits (Group 1).*
- matches the rest of the line (as no DOTALL option should be on)(?:\n(?!block\s*\d).*)*
- match any text after that is not a word block
followed with optional whitespace(s) followed with a digit (this way, a boundary is set)\battrib\s*b=(\w+)
- match a whole word attrib
followed with 0+ whitespaces, literal b=
, and match and capture 1+ alphanumerics or underscore (note: this can be adjusted as per your real data) with (\w+)
import re
p = re.compile(r'(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)')
s = "block 1\n subblock 1.1\n attrib a=a1\n subblock 1.2\n attrib b=b1\nblock 2\n subblock 2.1\n attrib a=a2\nblock 3\n subblock 3.1\n attrib a=a3\n subblock 3.2\n attrib b=b3"
print(p.findall(s))
Upvotes: 2
Reputation: 6776
What about this regex? https://regex101.com/r/yZ4fL9/1
block (\d).*?attrib b=b(\1)
Upvotes: 0