Reputation: 191
I am struggling with regular expressions. I`m having problems getting my head wrapped around similar text nested within larger text. Perhaps you can help me unclutter my thinking.
Here is an example test string:
message msgName { stuff { innerStuff } } \n message mn2 { junk }
I want to pull out term (e.g., msgName
, mn2
) and what follows until the next message, to get a list like this:
msgName { stuff { innerStuff } more stuff } mn2 { junk }'
I am having trouble with too greedily or non-greedily matching to retain the inner brackets but split apart the higher level messages.
Here is one program:
import re
text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }'
messagePattern = re.compile('message (.*?) {(.*)}', re.DOTALL)
messageList = messagePattern.findall(text)
print "messages:\n"
count = 0
for message, msgDef in messageList:
count = count + 1
print str(count)
print message
print msgDef
It produces:
messages: 1 msgName stuff { innerStuff } more stuff } message mn2 { junk
Here is my next attempt, which makes the inner part non-greedy:
import re
text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }'
messagePattern = re.compile('message (.*?) {(.*?)}', re.DOTALL)
messageList = messagePattern.findall(text)
print "messages:\n"
count = 0
for message, msgDef in messageList:
count = count + 1
print str(count)
print message
print msgDef
It produces:
messages: 1 msgName stuff { innerStuff 2 mn2 junk
So, I lose } more stuff }
I've really run into a mental block on this. Could someone point me in the right direction? I`m failing to deal with text in nested brackets. A suggestion as to a working regular expression or a simpler example of dealing with nested, similar text would be helpful.
Upvotes: 4
Views: 98
Reputation: 627469
If you can use PyPi regex
module, you can leverage its subroutine call support:
>>> import regex
>>> reg = regex.compile(r"(\w+)\s*({(?>[^{}]++|(?2))*})")
>>> s = "message msgName { stuff { innerStuff } } \n message mn2 { junk }"
>>> print(reg.findall(s))
[('msgName', '{ stuff { innerStuff } }'), ('mn2', '{ junk }')]
The regex - (\w+)\s*({(?>[^{}]++|(?2))*})
- matches:
(\w+)
- Group 1 matching 1 or more alphanumeric / underscore characters\s*
- 0+ whitespace(s)({(?>[^{}]++|(?2))*})
- Group 2 matching a {
, followed with non-{}
or another balanced {...}
due to the (?2)
subroutine call (recurses the whole Group 2 subpattern), 0 or more times, and then matches a closing }
.If there is only one nesting level, re
can be used, too, with
(\w+)\s*{[^{}]*(?:{[^{}]*}[^{}]*)*}
See this regex demo
(\w+)
- Group 1 matching word characters\s*
- 0+ whitespaces{
- opening brace[^{}]*
- 0+ characters other than {
and }
(?:{[^{}]*}[^{}]*)*
- 0+ sequences of:
{
- opening brace[^{}]*
- 0+ characters other than {
and }
}
- closing brace[^{}]*
- 0+ characters other than {
and }
}
- closing braceUpvotes: 1