Reputation: 111
(I believe the problem is in (?s).*? just btw)
I need to extract some functions from files.
I have this code:
pattern = "^\s*[a-zA-Z_]?.*void\s+[a-zA-Z_][a-zA-Z_0-9]*\s*\((?s).*?\).*?$"
objekt = re.findall(re.compile(pattern,re.MULTILINE), string)
where string is
extern inline void
lineBreak ( void )
;
extern inline void debugPrintf
(
const int level,
const char * const format,
...)
{
return NULL;
}
extern void
debugPutc
(
const int level
,
const int c)
;
it returns however
extern inline void
lineBreak ( void )
;
extern inline void debugPrintf
(
const int level,
const char * const format,
...)
{
return NULL;
}
extern void
debugPutc
(
const int level
,
const int c)
while when I am debugging at regex101 it returns 3 functions that I need to extract.
Does anyone know where is the problem please? Thank you.
EDIT:
Just by the way before that I had this pattern:
"^\s*[a-zA-Z_]?.*void\s+[a-zA-Z_][a-zA-Z_0-9]*\s*\((?:\s*?.*?\s*?)*\)\s*?$"
and everything was working fine, except there was catastrophic backtracking and didnt work for other types then void (like double)
Upvotes: 0
Views: 915
Reputation: 75222
You're right, it's the (?s)
that's messing you up. In most flavors that support inline modifiers, you can insert (?s)
anywhere in the regex, and single-line mode will start at that point and remain in effect until the end of the regex unless you turn it off with (?-s)
. If it's inside a group, the mode will reset when the group ends. Alternatively, you can use a mode-modified group (a non-capturing group with an embedded mode modifier): (?s:...)
.
But Python is not nearly so flexible. It doesn't support mode-modified groups, and an inline modifier always affects the whole regex, no matter where you place it. As Markus said, the solution is to use [\S\s]*?
instead (an idiom often used in JavaScript regexes, which have no singleline/DOTALL mode at all).
I also recommend that you use Python's raw string notation for regexes:
pattern = r"^\s*.*void\s+[a-zA-Z_][a-zA-Z_0-9]*\s*\([\s\S]*?\).*?$"
Upvotes: 1
Reputation: 89171
It says in the documentation that
Note that the
(?x)
flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.
Other strange things seems to happen for the other flags. .
at the start of the pattern was affected by the (?s)
at the end, on the second and subsequent matches.
Python does not have any way to turn off the flags, once set. And there is no way to scope the flags. (In Perl and some other flavors, you can use scoped (?s:.*?)
and disable (?-s)
)
Another way to write the pattern, that would have the effect you seek:
pattern = r"^.*?\bvoid\s+[a-zA-Z_][a-zA-Z_0-9]*\s*\([\S\s]*?\).*$"
\b
matches a word boundary. Between a word character (A-Z, a-z, 0-9 and "_"), and a non-word character.[\S\s]
will match any non-whitespace OR whitespace character. That is, any character, including linebreaks.Upvotes: 3