nobody
nobody

Reputation: 8303

Python RegEx Meaning

I'm new to python regular expressions and was wondering if someone could help me out by walking me through what this means (I'll state what I think each bit means here as well).

Thanks!

RegExp:
r'(^.*def\W*)(\w+)\W*\((.*)\):'

r'...' = python definition of regular expression within the ''
(...) = a regex term
(^. = match the beginning of any character
*def\W* = ???
(\w+) = match any of [a, z] 1 or more times
\W*\ = ? i think its the same as the line above this but from 0+ more times instead of 1 but since it matches the def\W line above (which i dont really know the meaning of) i'm not sure.
((.*)\): = match any additional character within brackets ()

thanks!

Upvotes: 10

Views: 5774

Answers (6)

lvella
lvella

Reputation: 13491

  • r'...' → the preferred way to define the string of a regular expression in Python
  • (...) → regex term
  • ^ → matches only with the beginning of the string

So, in the first pair of parenthesis (^.def\W), firstly it matches with the string.

  • . → matches any character
  • * → repeat the previous match 0 or more times

Then .* will match anything any number of times. The following 'def' is an exact match, that only matches with itself.

  • \W → matches anything that is NOT a letter, nor a number, nor the underscore character.

Then \W* will match zero or more of these non-letter-number-underscore characters. The next pair of parenthesis (\w+) you got it right. In the last part \W*\((.*)\): the initial \W* means the same thing as the previous \W* . Next, \( matches with ( , then there is the group (.*) which means, as previously, anything any number of times, followed by \): that matches ): .

An example of string that is matched by this regular expression is:

thing_def = function_name (123 anything in here):

Upvotes: 0

jfs
jfs

Reputation: 414865

It seems like a failed attempt to match a Python function signature:

import re

regex = re.compile(r""" # r'' means that \n and the like is two chars
                        # '\\','n' and not a single newline character

    ( # begin capturing group #1; you can get it: regex.match(text).group(1)
      ^   # match begining of the string or a new line if re.MULTILINE is set
      .*  # match zero or more characters except newline (unless
          # re.DOTALL is set)
      def # match string 'def'
      \W* # match zero or more non-\w chars i.e., [^a-zA-Z0-9_] if no
          # re.LOCALE or re.UNICODE
    ) # end capturing group #1

    (\w+) # second capturing group [a-zA-Z0-9_] one or more times if
          # no above flags

    \W*   # see above

    \(    # match literal paren '('
      (.*)  # 3rd capturing group NOTE: `*` is greedy `.` matches even ')'
            # therefore re.match(r'\((.*)\)', '(a)(b)').group(1) == 'a)(b'
    \)    # match literal paren ')'
     :    # match literal ':'
    """, re.VERBOSE|re.DEBUG)

re.DEBUG flag causes the output:

subpattern 1
  at at_beginning
  max_repeat 0 65535
    any None
  literal 100
  literal 101
  literal 102
  max_repeat 0 65535
    in
      category category_not_word
subpattern 2
  max_repeat 1 65535
    in
      category category_word
max_repeat 0 65535
  in
    category category_not_word
literal 40
subpattern 3
  max_repeat 0 65535
    any None
literal 41
literal 58

more

Upvotes: 6

Danail Gabenski
Danail Gabenski

Reputation: 3680

  1. Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it.

  2. ( ) groups a statement, it's treated as ONE thing, so you can do ()? or ()* or ()+ if the things in the brackets need to handled together.

  3. ^ matches it if it's the beginning of the string

  4. (Dot.) In the default mode, this matches any character except a newline.

  5. Since it is ".*" the * means match 0 or more matches of the previous thing, which in this case is any character.

  6. def\W* - lines beginning with the string "def" and then \W matches any non-alphanumeric characters and is equivalent to [^a-zA-Z0-9_]. Since we have the * again, this time it matches 0 or more non-alphanumeric characters.

  7. (\w+), the + is for 1 or more of the previous, which in this case is \w, equivalent to [a-zA-Z0-9_].

7.\W*, we know that one already.

  1. "(" - means just to match "(", differentiate it from () grouping things, same for ")/"

  2. (.*) - match 0 or more characters inside.

  3. : - the matched string ends with a colon at the end.

The whole thing seems to be matching a definition of a function in python i.e. "def foo(x):" will be matched. Dealing with regular expressions is hard - using tools such as http://www.pythonregex.com/ helps me to try out different things. And since RE are slightly different in different languages, it's nice to have tools for those too.

Upvotes: 0

user177800
user177800

Reputation:

r'(^.*def\W\*)(\w+)\W*((.*)):'

==================================

r' tells python this is a raw string so you don't have to double escape all the \

^ match start of string

() in each case this means group this match where each () is a different group

.* match zero or more of any characters

def match the literal 'def'

\W* match zero or more of any non word character

() more grouping of the contained expression

\w+ match one or more of word character

\W* zero or more of any non word character

\( escape the left paren

() more grouping of the contained expression

.* zero of more of any character

\) escape the right paren

: match a single colon literal

This looks like it is trying to match a python method definition. Here is a link to play with this regular expression. Yes it is powered by Ruby, but the syntax is pretty much the same across all languages, I use this site to test regexes for Python, Java and Ruby.

Upvotes: 0

Jacob Eggers
Jacob Eggers

Reputation: 9332

r'..' = Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'.

(...) = a capture group which stores the captured value in a var to be used in replacement/mathing.

^ = the start of the string.

.* = 0 or more chars of any type.

def = the literal string def

\W* = 0 or more non word chars (Anything other than a-zA-Z or _)

\w+ = 1 or more word chars (see above)

\( = escapes the (, therefore means a literal (

\) = same as above.

: = literal :


PS: I like the effort that you made to try to understand the regex. It will serve you well, much better than people asking what does r'(^.*def\W*)(\w+)\W*\((.*)\):' mean.

Upvotes: 3

larsks
larsks

Reputation: 312680

r'...' = python definition of regular expression within the ''

The r'' syntax has nothing to with regular expressions (or at least, not directly). The r stands for raw and is simply an indicator to Python that no string interpolation should be performed on the string.

This is often used with regular expressions so that you don't have to escape backslash (\) characters, which would otherwise be eaten by the normal string interpolation mechanism.

(^. = match the beginning of any character

I'm not sure what "beginning of any character" means. The ^ character matches the beginning of a line.

def\W = ???

def matches the characters def. For \W, take a look at pydoc re, which describes the regular expression language.

\W*

As above.

Other than the above, your interpretation seems largely correct.

Upvotes: 0

Related Questions