Reputation: 14003
I'm trying to detect valid Java annotations in a text. Here's my test program (I'm currently ignoring all whitespace for simplicity, I'll add this later):
txts = ['@SomeName2', # match
'@SomeName2(', # no match
'@SomeName2)', # no match
'@SomeName2()', # match
'@SomeName2()()', # no match
'@SomeName2(value)', # no match
'@SomeName2(=)', # no match
'@SomeName2("")', # match
'@SomeName2(".")', # no match
'@SomeName2(",")', # match
'@SomeName2(value=)', # no match
'@SomeName2(value=")', # no match
'@SomeName2(=3)', # no match
'@SomeName2(="")', # no match
'@SomeName2(value=3)', # match
'@SomeName2(value=3L)', # match
'@SomeName2(value="")', # match
'@SomeName2(value=true)', # match
'@SomeName2(value=false)', # match
'@SomeName2(value=".")', # no match
'@SomeName2(value=",")', # match
'@SomeName2(x="o_nbr ASC, a")', # match
# multiple params:
'@SomeName2(,value="ord_nbr ASC, name")', # no match
'@SomeName2(value="ord_nbr ASC, name",)', # no match
'@SomeName2(value="ord_nbr ASC, name"insertable=false)', # no match
'@SomeName2(value="ord_nbr ASC, name",insertable=false)', # match
'@SomeName2(value="ord_nbr ASC, name",insertable=false,length=10L)', # match
'@SomeName2 ( "ord_nbr ASC, name", insertable = false, length = 10L )', # match
]
#regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?\))?$'
#regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?(,((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))*\))?$'
regex = r"""
(?:@[a-z]\w*) # @ + identifier (class name)
(
\( # opening parenthesis
(
(?:[a-z]\w*) # identifier (var name)
= # assigment operator
(\d+l?|"(?:[a-z0-9_, ]*)"|true|false) # either a numeric | a quoted string containing only alphanumeric chars, _, space | true | false
)? # optional assignment group
\) # closing parenthesis
)?$ # optional parentheses group (zero or one)
"""
rg = re.compile(regex, re.VERBOSE + re.IGNORECASE)
for txt in txts:
m = rg.search(txt)
#m = rg.match(txt)
if m:
print "MATCH: ",
output = ''
for i in xrange(2):
output = output + '[' + str(m.group(i+1)) + ']'
print output
else:
print "NO MATCH: " + txt
So basically what I have seems to work for zero or one parameters. Now I'm trying to extend the syntax to zero or more parameters, like in the last example.
I then copied the regex part that represents the assignment and prepend it by a comma for the 2nd to nth group (this group now using * instead of ?):
regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?(,((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))*\))?$'
That cannot work however. The problem seems to be how to handle the first element, because the it must be optional, then strings like the first extension example '@SomeName2(,value="ord_nbr ASC, name")'
would be accepted, which is wrong. I have no idea how to make the 2nd to nth assignment depend only on the presence of the first (optional) element.
Can it be done? Is it done that way? How do you best solve this?
Thanks
Upvotes: 1
Views: 1430
Reputation: 4842
If you're just trying to detect valid syntax, I believe the regex below will give you the matches you want. But I'm not sure what you are doing with the groups. Do you want each parameter value in its own group as well? That will be harder, and I'm not even sure it's even possible with regex.
regex = r'((?:@[a-z][a-z0-9_]*))(?:\((?!,)(?:(([a-z][a-z0-9_]*(=)(?:("[a-z0-9_, ]*")|(true|false)|(\d+l?))))(?!,\)),?)*\)(?!\()|$)'
If you need the individual parameters/values, you probably need to write a real parser for that.
EDIT:
Here's a commented version. I also removed many of the capturing and non-capturing groups to make it easier to understand. If you use this with re.findall()
it will return two groups: the function name, and all the params in parentheses:
regex = r'''
(@[a-z][a-z0-9_]*) # function name, captured in group
( # open capture group for all parameters
\( # opening function parenthesis
(?!,) # negative lookahead for unwanted comma
(?: # open non-capturing group for all params
[a-z][a-z0-9_]* # parameter name
= # parameter assignmentoperators
(?:"[a-z0-9_, ]*"|true|false|(?:\d+l?)) # possible parameter values
(?!,\)) # negative lookahead for unwanted comma and closing parenthesis
,? # optional comma, separating params
)* # close param non-capturing group, make it optional
\) # closing function parenthesis
(?!\(\)) # negative lookahead for empty parentheses
|$ # OR end-of-line (in case there are no params)
) # close capture group for all parameters
'''
After reading your comment about the parameters, the easiest thing will probably be to use the above regex to pull out all the parameters, then write another regex to pull out name/value pairs to do with as you wish. This will be tricky too, though, because there are commas in the parameter values. I'll leave that as an exercise for the reader :)
Upvotes: 2
Reputation: 12486
You've done some funny things here. Here's your original regex:
regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"
(?:[a-z0-9_, ]*)"|true|false))?\))?$'
For starters, use the re.VERBOSE flag so you can break this across multiple lines. This way whitespace and comments in the regular expression do not affect its meaning, so you can document what the regular expression is trying to do.
regex = re.compile("""
((?:@[a-z][a-z0-9_]*)) # Match starting symbol, @-sign followed by a word
(\(
(((?:[a-z][a-z0-9_]*)) # Match arguments??
(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))? # ?????
\))?$
""", re.VERBOSE + re.IGNORECASE)
Since you haven't documented what this regex is trying to do, I cant decompose it any further. Document the intent of any non-trivial regular expression by using re.VERBOSE, breaking it across multiple lines, and commenting it.
Your regex is quite hard to understand because it's trying to do too much. As it stands, your regex is trying to do two things:
@SomeSymbol2
, optionally followed by a parenthesised list of arguments, (arg1="val1",arg2="val2"...)
(arg1="val1",arg2="val2")
passes but (232,211)
doesn't.I would suggest breaking this into two parts, as below:
import re
import pprint
txts = [
'@SomeName2', # match
'@SomeName2(', # no match
'@SomeName2)', # no match
'@SomeName2()', # match
'@SomeName2()()', # no match
'@SomeName2(value)', # no match
'@SomeName2(=)', # no match
'@SomeName2("")', # no match
'@SomeName2(value=)', # no match
'@SomeName2(value=")', # no match
'@SomeName2(=3)', # no match
'@SomeName2(="")', # no match
'@SomeName2(value=3)', # match
'@SomeName2(value=3L)', # match
'@SomeName2(value="")', # match
'@SomeName2(value=true)', # match
'@SomeName2(value=false)', # match
'@SomeName2(value=".")', # no match
'@SomeName2(value=",")', # match
'@SomeName2(value="ord_nbr ASC, name")', # match
# extension needed!:
'@SomeName2(,value="ord_nbr ASC, name")', # no match
'@SomeName2(value="ord_nbr ASC, name",)', # no match
'@SomeName2(value="ord_nbr ASC, name",insertable=false)'
] # no match YET, but should
# Regular expression to match overall @symbolname(parenthesised stuff)
regex_1 = re.compile( r"""
^ # Start of string
(@[a-zA-Z]\w*) # Matches initial token. Token name must start with a letter.
# Subsequent characters can be any of those matched by \w, being [a-zA-Z0-9_]
# Note behaviour of \w is LOCALE dependent.
( \( [^)]* \) )? # Optionally, match parenthesised part containing zero or more characters
$ # End of string
""", re.VERBOSE)
#Regular expression to validate contents of parentheses
regex_2 = re.compile( r"""
^
(
([a-zA-Z]\w*) # argument key name (i.e. 'value' in the examples above)
= # literal equals symbol
( # acceptable arguments are:
true | # literal "true"
false | # literal "false"
\d+L? | # integer (optionally followed by an 'L')
"[^"]*" # string (may not contain quote marks!)
)
\s*,?\s* # optional comma and whitespace
)* # Match this entire regex zero or more times
$
""", re.VERBOSE)
for line in txts:
print("\n")
print(line)
m1 = regex_1.search(line)
if m1:
annotation_name, annotation_args = m1.groups()
print "Symbol name : ", annotation_name
print "Argument list : ", annotation_args
if annotation_args:
s2 = annotation_args.strip("()")
m2 = regex_2.search(s2)
if (m2):
pprint.pprint(m2.groups())
print "MATCH"
else:
print "MATCH FAILED: regex_2 didn't match. Contents of parentheses were invalid."
else:
print "MATCH"
else:
print "MATCH FAILED: regex_1 didn't match."
This nearly gets you to a final solution. The only corner case I can see is that this (incorrectly) matches a trailing comma in the argument list. (You can check for this using a simple string operation, str.endswith()
.)
Edit Afterthought: The syntax for the argument list is actually pretty close to a real data format - you could probably feed argument_list
to a JSON or YAML parser and it would tell you if it was good or not. Use the existing wheel (JSON parser) instead of reinventing the wheel, if you can.
This would allow, amongst other things -
"This is a quote mark: \"."
because it thinks the second quote ends the string. (It doesn't.)This can be done in regex, but it's horrible and complicated.
Upvotes: 1