dotancohen
dotancohen

Reputation: 31481

Perl search / replace with capture group on CLI

PHP's strlen() function is not UTF-8 aware, so I would like to swap each usage of strlen() with its UTF-8 aware counterpart: mb_strlen(). However, mb_strlen() requires an additional argument:

$length = strlen($someString);
$length = mb_strlen($someString, 'UTF-8');

Had there not been a second argument, a simple Perl regex would handle the swap:

$ find . -name '*' -print0 | xargs -0 perl -pi -e 's/strlen/mb_strlen/g'

I tried using capture groups and backreferences but the VIM-style syntax either does not support that (on a recent Ubuntu) or I cannot figure it out. I've tried several variations on this without success:

$ find . -name '*' -print0 | xargs -0 perl -pi -e 's/strlen\((\.*)\)/mb_strlen\($1, "UTF-8"\)/g'

Furthermore, there may be functions such as trim() inside strlen() so I would have to make this greedy but I'm not sure where the greedy operator should go exactly. How should this regex be written?

Upvotes: 0

Views: 2010

Answers (4)

Vijay
Vijay

Reputation: 67231

find . -type f|xargs perl -pi -e 's/strlen\(([^\)]*)\)/mb_strlen($1,'UTF_8')/g'

Upvotes: 0

user1919238
user1919238

Reputation:

Your problem is not solvable in the general case with a simple regex. Consider these examples:

if (strlen($var) > 0)

$total_length = strlen($thing1) + strlen($thing2);

strlen($var);   #Don't use trim() here because it was already trimmed.

some_other_function(strlen($foo) + 2);

None of these would work with your regex, because .* will greedily capture everything up until the last close parenthesis in the line. The only way to do this correctly is check for balanced parentheses, which is non-trivial in a regex (though it is technically possible with Perl's extended regex features, it would be no easy task).

If you don't think you'll run into very many of the cases above, then just use one of the other suggested solutions and check for errors. Or you could do this to catch all of the simple cases that don't have any parentheses within them:

s/\bstrlen\(([^()]*)\)/mb_strlen($1, "UTF-8")/g

(Note, I also added \b to make sure it starts at a word boundary. This will stop you from double-replacing things that are already mb_strlen)

However, there is an easy quick hack solution that should work for all cases: create your own PHP function called my_mb_strlen, or whatever, that calls mb_strlen while adding the additional argument. Then you can perform a much simpler search and replace for the function name only, replacing strlen with my_mb_strlen.

Upvotes: 0

Richard Huxton
Richard Huxton

Reputation: 22893

This is more difficult than it first appears. You either need to:

  1. Parse the expression properly, including multi-line versions of the expression.
  2. Cheat

I'd go for cheat.

Most of the strlen() calls will be quite simple, the handful that are left can be manually replaced. And you're doing this under some sort of version-control, aren't you:

Simple: strlen("foo"), strlen($bar)

# Match simple quoted strings - no embedded quotes
s/strlen\((["'][^"']*["'])\)/mb_strlen($1, 'UTF-8')/g
# Match simple variables - no method calls etc
s/strlen\((\$\w+)\)/mb_strlen($1, 'UTF-8')/g

Handling array-variables, function and method calls and other expressions gets more complicated, but see how many are left after these two basic replacements.

Upvotes: 1

Zaid
Zaid

Reputation: 37146

By specifying \.*, the regex will match 0 or more literal '.'s.

Try it after omitting the \:

s/strlen\((.*)\)/mb_strlen($1, "UTF-8")/g
           ^              ^           ^
           NO BACKSLASH   NO BACKSLASH NEEDED
                          AS THIS IS TREATED AS
                          A STRING AND NOT A REGEX

Also, try testing it without the -i flag first to make sure you're happy with the substitution, else your files will be modified in-situ.

Upvotes: 0

Related Questions