XYZ
XYZ

Reputation: 191

nested text in regular expressions

I am struggling with regular expressions. I`m having problems getting my head wrapped around similar text nested within larger text. Perhaps you can help me unclutter my thinking.

Here is an example test string:

message msgName { stuff { innerStuff } } \n message mn2 { junk }

I want to pull out term (e.g., msgName, mn2) and what follows until the next message, to get a list like this:

msgName 
{ stuff { innerStuff } more stuff } 
mn2 
{ junk }'

I am having trouble with too greedily or non-greedily matching to retain the inner brackets but split apart the higher level messages.

Here is one program:

import re
text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }'
messagePattern = re.compile('message (.*?) {(.*)}', re.DOTALL)
messageList = messagePattern.findall(text)
print "messages:\n"
count = 0
for message, msgDef in messageList:
    count = count + 1
    print str(count)
    print message
    print msgDef

It produces:

messages:

1
msgName
 stuff { innerStuff } more stuff } 
 message mn2 { junk 

Here is my next attempt, which makes the inner part non-greedy:

import re
text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }'
messagePattern = re.compile('message (.*?) {(.*?)}', re.DOTALL)
messageList = messagePattern.findall(text)
print "messages:\n"
count = 0
for message, msgDef in messageList:
    count = count + 1
    print str(count)
    print message
    print msgDef

It produces:

messages:

1
msgName
 stuff { innerStuff 
2
mn2
 junk 

So, I lose } more stuff }

I've really run into a mental block on this. Could someone point me in the right direction? I`m failing to deal with text in nested brackets. A suggestion as to a working regular expression or a simpler example of dealing with nested, similar text would be helpful.

Upvotes: 4

Views: 98

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627469

If you can use PyPi regex module, you can leverage its subroutine call support:

>>> import regex
>>> reg = regex.compile(r"(\w+)\s*({(?>[^{}]++|(?2))*})")
>>> s = "message msgName { stuff { innerStuff } } \n message mn2 { junk }"
>>> print(reg.findall(s))
[('msgName', '{ stuff { innerStuff } }'), ('mn2', '{ junk }')]

The regex - (\w+)\s*({(?>[^{}]++|(?2))*}) - matches:

  • (\w+) - Group 1 matching 1 or more alphanumeric / underscore characters
  • \s* - 0+ whitespace(s)
  • ({(?>[^{}]++|(?2))*}) - Group 2 matching a {, followed with non-{} or another balanced {...} due to the (?2) subroutine call (recurses the whole Group 2 subpattern), 0 or more times, and then matches a closing }.

If there is only one nesting level, re can be used, too, with

(\w+)\s*{[^{}]*(?:{[^{}]*}[^{}]*)*}

See this regex demo

  • (\w+) - Group 1 matching word characters
  • \s* - 0+ whitespaces
  • { - opening brace
  • [^{}]* - 0+ characters other than { and }
  • (?:{[^{}]*}[^{}]*)* - 0+ sequences of:
    • {- opening brace
    • [^{}]* - 0+ characters other than { and }
    • } - closing brace
    • [^{}]* - 0+ characters other than { and }
  • } - closing brace

Upvotes: 1

Related Questions