Shadowmak
Shadowmak

Reputation: 111

Python regex - wierd behavior - findall doesnt match regex101

(I believe the problem is in (?s).*? just btw)

I need to extract some functions from files.

I have this code:

pattern = "^\s*[a-zA-Z_]?.*void\s+[a-zA-Z_][a-zA-Z_0-9]*\s*\((?s).*?\).*?$"
objekt = re.findall(re.compile(pattern,re.MULTILINE), string)

where string is

extern inline void
lineBreak              (     void     )

;


extern      inline void           debugPrintf
(
const int level,
          const char        *  const                    format,
...)
{
return NULL;
}

extern void
debugPutc
(
const int level
,
const int c)
;

it returns however

extern inline void
lineBreak              (     void     )

;


extern      inline void           debugPrintf
(
const int level,
          const char        *  const                    format,
...)
{
return NULL;
}

extern void
debugPutc
(
const int level
,
const int c)

while when I am debugging at regex101 it returns 3 functions that I need to extract.

regex101 demo

Does anyone know where is the problem please? Thank you.

EDIT:

Just by the way before that I had this pattern:

"^\s*[a-zA-Z_]?.*void\s+[a-zA-Z_][a-zA-Z_0-9]*\s*\((?:\s*?.*?\s*?)*\)\s*?$"

and everything was working fine, except there was catastrophic backtracking and didnt work for other types then void (like double)

Upvotes: 0

Views: 915

Answers (2)

Alan Moore
Alan Moore

Reputation: 75222

You're right, it's the (?s) that's messing you up. In most flavors that support inline modifiers, you can insert (?s) anywhere in the regex, and single-line mode will start at that point and remain in effect until the end of the regex unless you turn it off with (?-s). If it's inside a group, the mode will reset when the group ends. Alternatively, you can use a mode-modified group (a non-capturing group with an embedded mode modifier): (?s:...).

But Python is not nearly so flexible. It doesn't support mode-modified groups, and an inline modifier always affects the whole regex, no matter where you place it. As Markus said, the solution is to use [\S\s]*? instead (an idiom often used in JavaScript regexes, which have no singleline/DOTALL mode at all).

I also recommend that you use Python's raw string notation for regexes:

pattern = r"^\s*.*void\s+[a-zA-Z_][a-zA-Z_0-9]*\s*\([\s\S]*?\).*?$"

Upvotes: 1

Markus Jarderot
Markus Jarderot

Reputation: 89171

It says in the documentation that

Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.

Other strange things seems to happen for the other flags. . at the start of the pattern was affected by the (?s) at the end, on the second and subsequent matches.

Python does not have any way to turn off the flags, once set. And there is no way to scope the flags. (In Perl and some other flavors, you can use scoped (?s:.*?) and disable (?-s))

Another way to write the pattern, that would have the effect you seek:

pattern = r"^.*?\bvoid\s+[a-zA-Z_][a-zA-Z_0-9]*\s*\([\S\s]*?\).*$"
  • \b matches a word boundary. Between a word character (A-Z, a-z, 0-9 and "_"), and a non-word character.
  • [\S\s] will match any non-whitespace OR whitespace character. That is, any character, including linebreaks.

Upvotes: 3

Related Questions