Reputation: 715
I am writing a python function to process multi-line SQL statements.
e.g.
multi_stmt = """
-- delete empty responses
DELETE FROM idlongDVR_responses WHERE new_response_code = '';
DELETE FROM idwideDVR_responses WHERE new_response_code = '';
-- create a current responses table for idlongDVR
DROP TABLE IF EXISTS idlongDVR_respCurr;
CREATE TABLE idlongDVR_respCurr
SELECT *, MAX(modifiedat) AS latest FROM idlongDVR_responses
GROUP BY sitecode, id, dass, tass, field, value, validation_message
ORDER BY sitecode, id, dass, tass; """
So I have written a regular expression to identify a newline if it is not followed by a double hyphen (start comment), and ends in a semi-colon
sql_line = re.compile(r"""
\n+ # starting from a new line sequence
(?!(--|\n)) # if not followed by a comment start "--" or newline
(.*?) # <<<<< WHY ARE THESE CAPTURING BRACKETS NEEDED?
; # ending with a semicolon
""", re.DOTALL|re.VERBOSE|re.MULTILINE)
stmts = sql_line.findall(multi_statement)
for stmt in stmts:
stmt = stmt[1]
if len(stmt) > 0:
cursor.execute(stmt)
It works OK but only if I enclose the .*?
term in brackets so it becomes (.*?)
. If I don't then I don't match anything.
Why is this? Thanks in advance.
Upvotes: 3
Views: 143
Reputation: 75252
You're using findall
, aren't you? findall
is weird. If you have any capturing groups in the regex, it returns only the contents of the capturing groups. As @KennyTM pointed out, you have a capturing group inside your lookahead, and since it's a negative lookahead, your overall match succeeds only when the group fails to capture anything. Thus, your list of empty strings.
Don't bother asking me why those empty groups don't show up when you use the brackets around the .*?
. From reading the docs, I would expect it to return a list of tuples consisting of two groups: one empty, and one the match you were expecting. But I only get the non-empty groups. That seems to be another layer of nonsensicalness to findall
that I haven't encountered before.
By the way, you don't need the MULTILINE
flag. All that does is change the behavior of the anchors, ^
and $
, allowing them to match at line boundaries as well as at the beginning and end of the whole string. Maybe you knew that already, but there's a very persistent rogue meme floating around that goes "If it's multiline, you have to use MULTILINE
", and I try to stamp on it whenever I see it.
Upvotes: 0
Reputation: 523624
"These capturing brackets are needed" because you used a capturing bracket inside the negative lookahead.
(?!(--|\n))
^ ^
Since this should never be matched, the first capturing group will always be empty in a successful match. Since some methods like .findall
will only return capturing groups (if they exist), you'll only see a list of empty strings.
Removing the (
...)
here should make the regex behave as you expect. BTW you could use [^;]*
instead of .*?
.
sql_line = re.compile(r"\n+(?!--|\n)[^;]*;")
Upvotes: 2