Reputation: 31481
PHP's strlen()
function is not UTF-8 aware, so I would like to swap each usage of strlen()
with its UTF-8 aware counterpart: mb_strlen()
. However, mb_strlen()
requires an additional argument:
$length = strlen($someString);
$length = mb_strlen($someString, 'UTF-8');
Had there not been a second argument, a simple Perl regex would handle the swap:
$ find . -name '*' -print0 | xargs -0 perl -pi -e 's/strlen/mb_strlen/g'
I tried using capture groups and backreferences but the VIM-style syntax either does not support that (on a recent Ubuntu) or I cannot figure it out. I've tried several variations on this without success:
$ find . -name '*' -print0 | xargs -0 perl -pi -e 's/strlen\((\.*)\)/mb_strlen\($1, "UTF-8"\)/g'
Furthermore, there may be functions such as trim()
inside strlen()
so I would have to make this greedy but I'm not sure where the greedy operator should go exactly. How should this regex be written?
Upvotes: 0
Views: 2010
Reputation: 67231
find . -type f|xargs perl -pi -e 's/strlen\(([^\)]*)\)/mb_strlen($1,'UTF_8')/g'
Upvotes: 0
Reputation:
Your problem is not solvable in the general case with a simple regex. Consider these examples:
if (strlen($var) > 0)
$total_length = strlen($thing1) + strlen($thing2);
strlen($var); #Don't use trim() here because it was already trimmed.
some_other_function(strlen($foo) + 2);
None of these would work with your regex, because .*
will greedily capture everything up until the last close parenthesis in the line. The only way to do this correctly is check for balanced parentheses, which is non-trivial in a regex (though it is technically possible with Perl's extended regex features, it would be no easy task).
If you don't think you'll run into very many of the cases above, then just use one of the other suggested solutions and check for errors. Or you could do this to catch all of the simple cases that don't have any parentheses within them:
s/\bstrlen\(([^()]*)\)/mb_strlen($1, "UTF-8")/g
(Note, I also added \b
to make sure it starts at a word boundary. This will stop you from double-replacing things that are already mb_strlen
)
However, there is an easy quick hack solution that should work for all cases: create your own PHP function called my_mb_strlen
, or whatever, that calls mb_strlen
while adding the additional argument. Then you can perform a much simpler search and replace for the function name only, replacing strlen
with my_mb_strlen
.
Upvotes: 0
Reputation: 22893
This is more difficult than it first appears. You either need to:
I'd go for cheat.
Most of the strlen() calls will be quite simple, the handful that are left can be manually replaced. And you're doing this under some sort of version-control, aren't you:
Simple: strlen("foo"), strlen($bar)
# Match simple quoted strings - no embedded quotes
s/strlen\((["'][^"']*["'])\)/mb_strlen($1, 'UTF-8')/g
# Match simple variables - no method calls etc
s/strlen\((\$\w+)\)/mb_strlen($1, 'UTF-8')/g
Handling array-variables, function and method calls and other expressions gets more complicated, but see how many are left after these two basic replacements.
Upvotes: 1
Reputation: 37146
By specifying \.*
, the regex will match 0 or more literal '.'
s.
Try it after omitting the \
:
s/strlen\((.*)\)/mb_strlen($1, "UTF-8")/g
^ ^ ^
NO BACKSLASH NO BACKSLASH NEEDED
AS THIS IS TREATED AS
A STRING AND NOT A REGEX
Also, try testing it without the -i
flag first to make sure you're happy with the substitution, else your files will be modified in-situ.
Upvotes: 0