fantleas
fantleas

Reputation: 41

Regex: Match a substring in all lines, except when the substring is inside a comment section

Here I go:

I'm coding a PHP application, and I've got a new official domain for it, where all the FAQ are now located. Some of the files in my script include help links to the old FAQ domain, so I want to replace them using the new domain. However, I want to keep the URLs linking to the old domain only if they are located under a comment or comment block (I still use the old domain for self-reference and other documentation).

So, basically, what I want to achieve is a regular expression that works given the following:

  1. Match all the occurrences of example.com in all lines*.
  2. Don't match the entire line, only the example.com string.
    • If the line starts with //, /*, or " *" don't match any example.com instance in that single line (although, this might be a problem if a comment block is closed in the same line where it was opened).

I usually write my block comments like this:

/* text
 * blah 
 * blah
*/

That's why I don't want to match "example.com" if it's situated after //, /*, or " *".

I figured it would be something like this:

^(?:(?!//|/\*|\s\*).?).*example\.com

But this has one issue: it matches the whole line, instead of "example.com" only (this causes problems mainly when two or more "example.com" strings are matched in a single line).

Can someone please help me fix my regex? Please note: It doesn't have to be a PHP regex, since I could always use a tool like grepWin to locally edit all the files at once.

Oh, and please let me know if there's a way to generalize block comments in some way, like this: once /* is found, do not match example.com until */ is found. That would be extremely useful. Is it possible to achieve it in general (non language-dependent) regular expressions?

Upvotes: 4

Views: 374

Answers (2)

Gumbo
Gumbo

Reputation: 655239

I would use some kind of tokenizer to tell comments and other language tokens apart.

As you’re processing PHP files, you should use PHP’s own tokenizer function token_get_all:

$tokens = token_get_all($source);

Then you can enumerate the tokens and separate the tokens by their type:

foreach ($tokens as &$token) {
    if (in_array($token[0], array(T_COMMENT, T_DOC_COMMENT, T_ML_COMMENT))) {
        // comment
    } else {
        // not a comment
        $token[1] = str_replace('example.com', 'example.net', $token[1]);
    }
}

At the end, put everything back together with implode.

For other languages that you don’t have a proper tokenizer at the hand, you can write your own little tokenizer:

preg_match_all('~/\*.*?\*/|//(?s).*|(example\.com)|.~', $code, $tokens, PREG_SET_ORDER);
foreach ($tokens as &$token) {
    if (strlen($token[1])) {
        $token = str_replace('example.com', 'example.net', $token[1]);
    } else {
        $token = $token[0];
    }
}
$code = implode('', $tokens);

Note that this does not take any other token like strings into account. So this won’t match example.com if it appears in a string but also in a ‘comment’ like:

'foo /* not a comment example.com */ bar'

Upvotes: 2

Tim Pietzcker
Tim Pietzcker

Reputation: 336148

A regex that only matches example.com if it's not inside a comment section (but that does not care about line comments, so you'd have to do this separately):

$result = preg_replace(
    '%example\.com # Match example.com
    (?!            # only if it\'s not possible to match
     (?:           # the following:
      (?!/\*)      #  (unless an opening comment starts first)
      .            #  any character
     )*            # any number of times
     \*/           # followed by a closing comment.
    )              # End of lookahead
    %sx', 
    'newdomain.com', $subject);

Upvotes: 2

Related Questions