Reputation: 105
I'm suppose to capture everything inside a tag and the next lines after it, but it's suppose to stop the next time it meets a bracket. What am i doing wrong?
import re #regex
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
(\b(?:.|\s)*(?!\[)) # should read: anyword that doesn't precede a bracket
""", re.MULTILINE | re.VERBOSE)
haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
@[this should be taken though as this is in the content]
[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m
what im trying to get is:
[('tab1', 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n', '[tab2]','help me\nwrite a better RE\n')]
edit:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
this seems to work but it's also trimming the brackets inside the content.
Upvotes: 2
Views: 512
Reputation: 3459
Python regex doesn't support recursion afaik.
EDIT: but in your case this would work:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
EDIT 2: yes, it doesn't work properly.
import re
regex = re.compile(r"""
(?:^|\n)\[ # tag's opening bracket
([^\]\n]*) # 1. text between brackets
\]\n # tag's closing bracket
(.*?) # 2. text between the tags
(?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
""", re.DOTALL | re.VERBOSE)
haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag
[tag2]
help me
write a better RE[[[]
"""
print regex.findall(haystack)
I do agree with viraptor though. Regex are cool but you can't check your file for errors with them. A hybrid perhaps? :P
tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))
result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()
print result
EDIT 3: That's because ^
character means negative match only inside [^squarebrackets]
. Everywhere else it means string start (or line start with re.MULTILINE
). There's no good way for negative string matching in regex, only character.
Upvotes: 3
Reputation: 34145
First of all why a regex if you're trying to parse? As you can see you cannot find the source of the problem yourself, because regex gives no feedback. Also you don't have any recursion in that RE.
Make your life simple:
def ini_parse(src):
in_block = None
contents = {}
for line in src.split("\n"):
if line.startswith('[') and line.endswith(']'):
in_block = line[1:len(line)-1]
contents[in_block] = ""
elif in_block is not None:
contents[in_block] += line + "\n"
elif line.strip() != "":
raise Exception("content out of block")
return contents
You get error handling with exceptions and the ability to debug execution as a bonus. Also you get a dictionary as a result and can handle duplicate sections while processing. My result:
{'tab2': 'help me\nwrite a better RE\n\n',
'tab1': 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n\n'}
RE is much overused these days...
Upvotes: 3
Reputation: 143094
Does this do what you want?
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^[]*)
""", re.MULTILINE | re.VERBOSE)
This gives a list of tuples (one 2-tuple per match). If you want a flattened tuple you can write:
m = sum(regex.findall(haystack), ())
Upvotes: 2