Reputation: 159

Perl command line replace for unicode

I am trying to replace every word (stored in a tmp file called _id) with a number using shell script. It works fine except for unicode words, for which a number is generated but replacement using Perl does not work. The bash code under question is as below:

x=0
for id in `cat _id`; do
    echo $x $id
    perl -p -i -e "s/\b$id\b/$x/g" x_graph.dot
    x=$(($x + 1))
done

Can someone please point out to where the bug is?

Upvotes: 4

Answers (3)

AnFi

Reputation: 10903

Add -Mutf8(equivalent of use utf8;): This will enable UTF-8 in source code (-e one-liner in your case).
Add -CSDA: This will make perl use UTF-8 as default layer for input and output streams.

The following test produced desired result under LANG=en_US.UTF-8

echo "a ó b" > z.txt
id=ó
x=ń
perl -CD -Mutf8 -p -i -e "s/\b$id\b/$x/g" z.txt
cat z.txt

man perlrun

-C [number/list]
The -C flag controls some of the Perl Unicode features.
…
S 8 I + O + E [ STDIN is assumed to be in UTF-8, STDOUT and STDERR will be in UTF-8]
D 24 i + o [ UTF-8 is the default PerlIO layer for input and output streams]
A 32 the @ARGV elements are expected to be strings encoded in UTF-8

Upvotes: 3

ikegami

Reputation: 385635

Let's say you have é (U+00E9) encoded using UTF-8: C3 A9. Since you don't do any decoding, you obtain the string that's produced by "\xC3\xA9".

Regular expressions —or rather \b, \w, \d, etc— expect the input to be Unicode Code Points, which means you are effectively providing U+00C3 and U+00A9 instead of U+00E9. U+00C3 is a word character, but U+00A9 isn't, so the second \b doesn't match where it's expected to match.

So you need to decode your inputs and encode your outputs. -C provides a convenient way of doing that for UTF-8.

perl -i -CSDA -pe'
   BEGIN {
      ($id, $x) = splice(@ARGV, 0, 2);
      die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
   }

   s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot

Notes:

By using command-line arguments to pass the arguments, I fixed an injection error.
The use of \b assumes that $id will always start with a \w char and always end with a \w char, so I added a check to verify that assumption.
By using \Q..\E to convert the id into a regex pattern, I fixed an injection error.

Test:

$ printf "é\n" >_id

$ printf "[é]\n" >x_graph.dot

$ x=0

$ id=`cat _id`

$ perl -i -CSDA -pe'
   BEGIN {
      ($id, $x) = splice(@ARGV, 0, 2);
      die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
   }

   s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot

$ cat x_graph.dot
[0]

Upvotes: 4

Sinan Ünür

Reputation: 118118

See perldoc perlrun:

`-C` [number/list]

The -C flag controls some of the Perl Unicode features:

I     1   STDIN is assumed to be in UTF-8
O     2   STDOUT will be in UTF-8
E     4   STDERR will be in UTF-8
S     7   I + O + E
i     8   UTF-8 is the default PerlIO layer for input streams
o    16   UTF-8 is the default PerlIO layer for output streams
D    24   i + o
A    32   the @ARGV elements are expected to be strings encoded
          in UTF-8

So, at the very least, you'd want perl -COi, but perl -CSD looks tidier.

In addition, you may want to use

u match according to Unicode rules

with your s///. Or, write:

perl -CSD -Mutf8 -Mfeature=unicode_strings -p -i -e "s/\b$id\b/$x/g" x_graph.dot

Note the use of single quotation marks instead of double so as to avoid unintended interpolation.

Upvotes: 3

Perl command line replace for unicode

Answers (3)

-C [number/list]

Related Questions

`-C` [number/list]