Reputation: 8303
I'm new to python regular expressions and was wondering if someone could help me out by walking me through what this means (I'll state what I think each bit means here as well).
Thanks!
RegExp:
r'(^.*def\W*)(\w+)\W*\((.*)\):'
r'...' = python definition of regular expression within the ''
(...) = a regex term
(^. = match the beginning of any character
*def\W* = ???
(\w+) = match any of [a, z] 1 or more times
\W*\ = ? i think its the same as the line above this but from 0+ more times instead of 1 but since it matches the def\W line above (which i dont really know the meaning of) i'm not sure.
((.*)\): = match any additional character within brackets ()
thanks!
Upvotes: 10
Views: 5774
Reputation: 13491
So, in the first pair of parenthesis (^.def\W), firstly it matches with the string.
Then .* will match anything any number of times. The following 'def' is an exact match, that only matches with itself.
Then \W* will match zero or more of these non-letter-number-underscore characters. The next pair of parenthesis (\w+) you got it right. In the last part \W*\((.*)\): the initial \W* means the same thing as the previous \W* . Next, \( matches with ( , then there is the group (.*) which means, as previously, anything any number of times, followed by \): that matches ): .
An example of string that is matched by this regular expression is:
thing_def = function_name (123 anything in here):
Upvotes: 0
Reputation: 414865
It seems like a failed attempt to match a Python function signature:
import re
regex = re.compile(r""" # r'' means that \n and the like is two chars
# '\\','n' and not a single newline character
( # begin capturing group #1; you can get it: regex.match(text).group(1)
^ # match begining of the string or a new line if re.MULTILINE is set
.* # match zero or more characters except newline (unless
# re.DOTALL is set)
def # match string 'def'
\W* # match zero or more non-\w chars i.e., [^a-zA-Z0-9_] if no
# re.LOCALE or re.UNICODE
) # end capturing group #1
(\w+) # second capturing group [a-zA-Z0-9_] one or more times if
# no above flags
\W* # see above
\( # match literal paren '('
(.*) # 3rd capturing group NOTE: `*` is greedy `.` matches even ')'
# therefore re.match(r'\((.*)\)', '(a)(b)').group(1) == 'a)(b'
\) # match literal paren ')'
: # match literal ':'
""", re.VERBOSE|re.DEBUG)
re.DEBUG
flag causes the output:
subpattern 1
at at_beginning
max_repeat 0 65535
any None
literal 100
literal 101
literal 102
max_repeat 0 65535
in
category category_not_word
subpattern 2
max_repeat 1 65535
in
category category_word
max_repeat 0 65535
in
category category_not_word
literal 40
subpattern 3
max_repeat 0 65535
any None
literal 41
literal 58
Upvotes: 6
Reputation: 3680
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it.
( ) groups a statement, it's treated as ONE thing, so you can do ()? or ()* or ()+ if the things in the brackets need to handled together.
^ matches it if it's the beginning of the string
(Dot.) In the default mode, this matches any character except a newline.
Since it is ".*" the * means match 0 or more matches of the previous thing, which in this case is any character.
def\W* - lines beginning with the string "def" and then \W matches any non-alphanumeric characters and is equivalent to [^a-zA-Z0-9_]. Since we have the * again, this time it matches 0 or more non-alphanumeric characters.
(\w+), the + is for 1 or more of the previous, which in this case is \w, equivalent to [a-zA-Z0-9_].
7.\W*, we know that one already.
"(" - means just to match "(", differentiate it from () grouping things, same for ")/"
(.*) - match 0 or more characters inside.
: - the matched string ends with a colon at the end.
The whole thing seems to be matching a definition of a function in python i.e. "def foo(x):" will be matched. Dealing with regular expressions is hard - using tools such as http://www.pythonregex.com/ helps me to try out different things. And since RE are slightly different in different languages, it's nice to have tools for those too.
Upvotes: 0
Reputation:
r'(^.*def\W\*)(\w+)\W*((.*)):'
==================================
r' tells python this is a raw string so you don't have to double escape all the \
^ match start of string
() in each case this means group this match where each () is a different group
.* match zero or more of any characters
def match the literal 'def'
\W* match zero or more of any non word character
() more grouping of the contained expression
\w+ match one or more of word character
\W* zero or more of any non word character
\( escape the left paren
() more grouping of the contained expression
.* zero of more of any character
\) escape the right paren
: match a single colon literal
This looks like it is trying to match a python method definition. Here is a link to play with this regular expression. Yes it is powered by Ruby, but the syntax is pretty much the same across all languages, I use this site to test regexes for Python, Java and Ruby.
Upvotes: 0
Reputation: 9332
r'..' = Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'.
(...) = a capture group which stores the captured value in a var to be used in replacement/mathing.
^ = the start of the string.
.* = 0 or more chars of any type.
def = the literal string def
\W* = 0 or more non word chars (Anything other than a-zA-Z or _)
\w+ = 1 or more word chars (see above)
\( = escapes the (, therefore means a literal (
\) = same as above.
: = literal :
PS: I like the effort that you made to try to understand the regex. It will serve you well, much better than people asking what does r'(^.*def\W*)(\w+)\W*\((.*)\):'
mean.
Upvotes: 3
Reputation: 312680
r'...' = python definition of regular expression within the ''
The r''
syntax has nothing to with regular expressions (or at least, not directly). The r
stands for raw
and is simply an indicator to Python that no string interpolation should be performed on the string.
This is often used with regular expressions so that you don't have to escape backslash (\
) characters, which would otherwise be eaten by the normal string interpolation mechanism.
(^. = match the beginning of any character
I'm not sure what "beginning of any character" means. The ^
character matches the beginning of a line.
def\W = ???
def
matches the characters def
. For \W
, take a look at pydoc re
, which describes the regular expression language.
\W*
As above.
Other than the above, your interpretation seems largely correct.
Upvotes: 0