Reputation: 41
Here I go:
I'm coding a PHP application, and I've got a new official domain for it, where all the FAQ are now located. Some of the files in my script include help links to the old FAQ domain, so I want to replace them using the new domain. However, I want to keep the URLs linking to the old domain only if they are located under a comment or comment block (I still use the old domain for self-reference and other documentation).
So, basically, what I want to achieve is a regular expression that works given the following:
example.com
in all lines*.example.com
string.//
, /*
, or " *" don't match any example.com
instance in that single line (although, this might be a problem if a comment block is closed in the same line where it was opened).I usually write my block comments like this:
/* text
* blah
* blah
*/
That's why I don't want to match "example.com" if it's situated after //
, /*
, or " *".
I figured it would be something like this:
^(?:(?!//|/\*|\s\*).?).*example\.com
But this has one issue: it matches the whole line, instead of "example.com" only (this causes problems mainly when two or more "example.com" strings are matched in a single line).
Can someone please help me fix my regex? Please note: It doesn't have to be a PHP regex, since I could always use a tool like grepWin to locally edit all the files at once.
Oh, and please let me know if there's a way to generalize block comments in some way, like this: once /*
is found, do not match example.com
until */
is found. That would be extremely useful. Is it possible to achieve it in general (non language-dependent) regular expressions?
Upvotes: 4
Views: 374
Reputation: 655239
I would use some kind of tokenizer to tell comments and other language tokens apart.
As you’re processing PHP files, you should use PHP’s own tokenizer function token_get_all
:
$tokens = token_get_all($source);
Then you can enumerate the tokens and separate the tokens by their type:
foreach ($tokens as &$token) {
if (in_array($token[0], array(T_COMMENT, T_DOC_COMMENT, T_ML_COMMENT))) {
// comment
} else {
// not a comment
$token[1] = str_replace('example.com', 'example.net', $token[1]);
}
}
At the end, put everything back together with implode
.
For other languages that you don’t have a proper tokenizer at the hand, you can write your own little tokenizer:
preg_match_all('~/\*.*?\*/|//(?s).*|(example\.com)|.~', $code, $tokens, PREG_SET_ORDER);
foreach ($tokens as &$token) {
if (strlen($token[1])) {
$token = str_replace('example.com', 'example.net', $token[1]);
} else {
$token = $token[0];
}
}
$code = implode('', $tokens);
Note that this does not take any other token like strings into account. So this won’t match example.com
if it appears in a string but also in a ‘comment’ like:
'foo /* not a comment example.com */ bar'
Upvotes: 2
Reputation: 336148
A regex that only matches example.com
if it's not inside a comment section (but that does not care about line comments, so you'd have to do this separately):
$result = preg_replace(
'%example\.com # Match example.com
(?! # only if it\'s not possible to match
(?: # the following:
(?!/\*) # (unless an opening comment starts first)
. # any character
)* # any number of times
\*/ # followed by a closing comment.
) # End of lookahead
%sx',
'newdomain.com', $subject);
Upvotes: 2