Obomar
Obomar

Reputation: 61

Find a word in multiline comment with one regex

I need a regex that matches a specific capturing group which falls inside a multiline comment /* ... */.

In particular I need to find PHP variable definitions inside multiline comments

for example:

/* other code $var = value1 */
$var = value2 ;

/* 
other code
$var = value3 ;
other code
*/

must match only the two occurences of '$var =' inside the comments but not the one outside the comment.

for the above example I wrote a regex that uses unrestricted lookbehind, like this

(?<=[/][\*][^/]+)(\$var) | (?<=[/][\*][^\*]+)(\$var)

but this regex fails in case it finds both charachter * and / even if they are APART from one another, between the comment opening tag '/*' and $var, which is not the desired bahaviour:

for example it fails in the case:

$var = .... ;

/* 
other * code /
$var = .... ;
other code
*/

bacause it finds both '*' and '/' even if it's not the comment closing tag.

The key point is that I cannot negate a token which is combination of two charachter, but can only negate them one by one: [^*] or [^/].

...furthermore I cannot use the token [\s\S] instead of [^/] and [^*] because it would select $var out of comments preceded by a previous block of comment.

Any ideas? Is it even possibile with normal regex to achieve this? Or would I need something different?

Upvotes: 3

Views: 466

Answers (5)

Maneskin
Maneskin

Reputation: 11

Try on php, but java works

(?s)(?i)(^|\s+?)(/*)((.)(?!*/))?(this)(.?)(*/)

in this example finding word is "this"

Upvotes: 0

bobble bubble
bobble bubble

Reputation: 18490

Idea by use of \G to glue matches to /*

(?:/\*|\G(?!^))(?:(?!\*/)[^$])*\K\$var\s*=\s*(?:(?!\*/)[^$;])*

Might be hard to understand if you aren't doing a lot with regexes. See regex101 for demo.

\G can be seen as "glue", it is continuing at the end of a previous match. But \G also matches the start of the string. That's why the negative lookahead is used \G(?!^) only need to continue.

  • /\*|\G(?!^) This part is to find the beginning of a match at /* or continue matching.

  • (?:(?!\*/)[^$])* Match any ammount of characters that are not $ (negated class) while not ending the comment (?!\*/) for stuff before/between $var

  • \K\$var \K resets beginning of the reported match before $var occurs. \K can be useful as an alternative to a variable width lookebhind which is not available in pcre.

  • \s*=\s*(?:(?!\*/)[^$;])* to match the value of the variable. This is far from perfect. Would need modification if quoted values or not convenient for your input. After = it matches [^$;] characters, that are not dollar or semicolon (?!\*/) as long there's no */ ahead.

This regex does not check if there is actually a comment-end */ it just binds matches to /*
Another idea would be to use kind of this trick with verbs (*SKIP)(*FAIL) like in this demo.

Upvotes: 1

Alan Moore
Alan Moore

Reputation: 75222

This matches just $var, and only inside a multiline comment:

(?s)\$var(?=(?:(?!/\*|\*/).)*\*/)

DEMO

(?:(?!/\*|\*/).)* is a captive lookahead (also known as a Tempered Greedy Token--good name, but too many syllables), and it's how you exclude a sequence, as opposed to a single character. This one matches zero or more of any character (including newline, because of the (?s)), as long as it's not the first character of /* or */.

The enclosing lookahead succeeds if it finds */ without first encountering /*. That means the current position must be inside a comment (there's no need to match the opening /*). And because the lookahead doesn't consume any characters, you can match more than one item per comment, if you need to.

One thing that can fool this regex is a */ that's not really comment closer. So these:

$var = "*/";

$var = ...;
// */

... would match, even though they're not in a comment.

Upvotes: 2

Toto
Toto

Reputation: 91385

How about:

$str = '
/* other code */
$var = "var1";

/* 
other code
$var = "var2";
other code
*/
/* other code */
$var = "var3";

/* 
other code / <-- a slash here
$var = "var4";
other code
*/';

preg_match_all('~/\*(?:(?!\*/).)+?(\$var = .+?;).*?\*/~s', $str, $m);
print_r($m[1]);

Output:

Array
(
    [0] => $var = "var2";
    [1] => $var = "var4";
)

Upvotes: 1

Andreas Louv
Andreas Louv

Reputation: 47099

Something like this might work:

/\/\*.*?\$var\s*\=\s(.*?)(?=\s*;)/s

Usage:

$str = '$var = .... ;
/*
other code
$var = ..... ;
other code
*/';
preg_match('/\/\*.*?\$var\s*\=\s(.*?)(?=\s*;)/s', $str, $matches);

var_dump($matches);

Will output:

array(2) {
  [0]=>
  string(26) "/*
other code
$var = ....."
  [1]=>
  string(5) "....."
}

And your string is stored in $matches[1]

Try it online

Upvotes: 0

Related Questions