nested text in regular expressions

Question

I am struggling with regular expressions. I`m having problems getting my head wrapped around similar text nested within larger text. Perhaps you can help me unclutter my thinking.

Here is an example test string:

message msgName { stuff { innerStuff } } message mn2 { junk }

I want to pull out term (e.g., msgName, mn2) and what follows until the next message, to get a list like this:

msgName 
{ stuff { innerStuff } more stuff } 
mn2 
{ junk }'

I am having trouble with too greedily or non-greedily matching to retain the inner brackets but split apart the higher level messages.

Here is one program:

import re
text = 'message msgName { stuff { innerStuff } more stuff } 
 message mn2 { junk }'
messagePattern = re.compile('message (.*?) {(.*)}', re.DOTALL)
messageList = messagePattern.findall(text)
print "messages:
"
count = 0
for message, msgDef in messageList:
    count = count + 1
    print str(count)
    print message
    print msgDef

It produces:

messages:

1
msgName
 stuff { innerStuff } more stuff } 
 message mn2 { junk

Here is my next attempt, which makes the inner part non-greedy:

import re
text = 'message msgName { stuff { innerStuff } more stuff } 
 message mn2 { junk }'
messagePattern = re.compile('message (.*?) {(.*?)}', re.DOTALL)
messageList = messagePattern.findall(text)
print "messages:
"
count = 0
for message, msgDef in messageList:
    count = count + 1
    print str(count)
    print message
    print msgDef

It produces:

messages:

1
msgName
 stuff { innerStuff 
2
mn2
 junk

So, I lose } more stuff }

I've really run into a mental block on this. Could someone point me in the right direction? I`m failing to deal with text in nested brackets. A suggestion as to a working regular expression or a simpler example of dealing with nested, similar text would be helpful.

Wiktor Stribiżew · Accepted Answer

If you can use PyPi regex module, you can leverage its subroutine call support:

>>> import regex
>>> reg = regex.compile(r"(\w+)\s*({(?>[^{}]++|(?2))*})")
>>> s = "message msgName { stuff { innerStuff } } 
 message mn2 { junk }"
>>> print(reg.findall(s))
[('msgName', '{ stuff { innerStuff } }'), ('mn2', '{ junk }')]

The regex - (\w+)\s*({(?>[^{}]++|(?2))*}) - matches:

(\w+) - Group 1 matching 1 or more alphanumeric / underscore characters
\s* - 0+ whitespace(s)
({(?>[^{}]++|(?2))*}) - Group 2 matching a {, followed with non-{} or another balanced {...} due to the (?2) subroutine call (recurses the whole Group 2 subpattern), 0 or more times, and then matches a closing }.

If there is only one nesting level, re can be used, too, with

(\w+)\s*{[^{}]*(?:{[^{}]*}[^{}]*)*}

See this regex demo

(\w+) - Group 1 matching word characters
\s* - 0+ whitespaces
{ - opening brace
[^{}]* - 0+ characters other than { and }
(?:{[^{}]*}[^{}]*)* - 0+ sequences of:
- {- opening brace
- [^{}]* - 0+ characters other than { and }
- } - closing brace
- [^{}]* - 0+ characters other than { and }
} - closing brace

nested text in regular expressions

Answers (1)

Related Questions