Darren Oakey
Darren Oakey

Reputation: 3594

python regex weirdness

I thought I was ok with regex - but this has me confused - I have this line in python:

dependencies = re.findall( r"-- *depends *on *([^ ]*.*[^ ]) *$", script, re.MULTILINE)    

which works really well with:

"-- depends on    b    "    -> ["b"]
"-- depends on b"           -> ["b"]
"--dependson  green things    \n-- depends on red things\nother stuff"" -> ["green things", "red things"]
"-- depends on b \n-- depends on c" -> ["b", "c"]

but doesn't work on

"-- depends on b\n-- depends on c" -> ["b\n-- depends on c"]

I get that it's going to be some weirdness about the fact that $ matches before the newline - but what I don't get is how to fix the regex?

Upvotes: 1

Views: 62

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

In Python re, re.MULTILINE option only redefines the behavior of two anchors, ^ and $, that start matching start and end of any line, not just the whole string:

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Corresponds to the inline flag (?m).

Next, the [^ ] negated character class matches any char other than a literal regular space char (\x20, dec. code 32). Thus, [^ ]* matches any zero or more chars other than a space (including a newline, too).

You can use

-- *depends *on *(.*\S) *$

Or, if you can have non-breaking spaces or other horizontal Unicode spaces

--[^\S\r\n]*depends[^\S\r\n]*on[^\S\r\n]*(.*\S)[^\S\r\n]*$

In Python, you can use

h = r'[^\S\r\n]'
pattern = fr'--{h}*depends{h}*on{h}*(.*\S){h}*$'

The {h}*(.*\S) part does the job: zero or more spaces are matched and consumed first, then any zero or more chars other than line break chars as many as possible (.*) + a non-whitespace char (\S) are captured into Group 1.

Upvotes: 1

sbingner
sbingner

Reputation: 122

It's matching the "\n" newline as "not a space" you can fix it like so for this example:

-- *depends *on *([^ \n]*.*[^ \n]) *$

You probably really wanted something like:

--\s*depends\s*on\s*(\S*.*\S)\s*$

\s means "any space type" and \S means any NOT space type.

Upvotes: 0

Related Questions