Reputation: 159
I am trying to replace every word (stored in a tmp file called _id
) with a number using shell script. It works fine except for unicode words, for which a number is generated but replacement using Perl does not work. The bash code under question is as below:
x=0
for id in `cat _id`; do
echo $x $id
perl -p -i -e "s/\b$id\b/$x/g" x_graph.dot
x=$(($x + 1))
done
Can someone please point out to where the bug is?
Upvotes: 4
Views: 828
Reputation: 10903
Add -Mutf8
(equivalent of use utf8;
): This will enable UTF-8 in source code (-e
one-liner in your case).
Add -CSDA
: This will make perl
use UTF-8 as default layer for input and output streams.
The following test produced desired result under LANG=en_US.UTF-8
echo "a ó b" > z.txt
id=ó
x=ń
perl -CD -Mutf8 -p -i -e "s/\b$id\b/$x/g" z.txt
cat z.txt
-C [number/list]
The -C flag controls some of the Perl Unicode features.
…
S 8 I + O + E [ STDIN is assumed to be in UTF-8, STDOUT and STDERR will be in UTF-8]
D 24 i + o [ UTF-8 is the default PerlIO layer for input and output streams]
A 32 the @ARGV elements are expected to be strings encoded in UTF-8
Upvotes: 3
Reputation: 385635
Let's say you have é
(U+00E9) encoded using UTF-8: C3 A9
. Since you don't do any decoding, you obtain the string that's produced by "\xC3\xA9"
.
Regular expressions —or rather \b
, \w
, \d
, etc— expect the input to be Unicode Code Points, which means you are effectively providing U+00C3 and U+00A9 instead of U+00E9. U+00C3 is a word character, but U+00A9 isn't, so the second \b
doesn't match where it's expected to match.
So you need to decode your inputs and encode your outputs. -C
provides a convenient way of doing that for UTF-8.
perl -i -CSDA -pe'
BEGIN {
($id, $x) = splice(@ARGV, 0, 2);
die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
}
s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot
Notes:
By using command-line arguments to pass the arguments, I fixed an injection error.
The use of \b
assumes that $id
will always start with a \w
char and always end with a \w
char, so I added a check to verify that assumption.
By using \Q..\E
to convert the id into a regex pattern, I fixed an injection error.
Test:
$ printf "é\n" >_id
$ printf "[é]\n" >x_graph.dot
$ x=0
$ id=`cat _id`
$ perl -i -CSDA -pe'
BEGIN {
($id, $x) = splice(@ARGV, 0, 2);
die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
}
s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot
$ cat x_graph.dot
[0]
Upvotes: 4
Reputation: 118118
See perldoc perlrun:
-C
[number/list]The
-C
flag controls some of the Perl Unicode features:I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8 E 4 STDERR will be in UTF-8 S 7 I + O + E i 8 UTF-8 is the default PerlIO layer for input streams o 16 UTF-8 is the default PerlIO layer for output streams D 24 i + o A 32 the @ARGV elements are expected to be strings encoded in UTF-8
So, at the very least, you'd want perl -COi
, but perl -CSD
looks tidier.
In addition, you may want to use
u
match according to Unicode rules
with your s///
. Or, write:
perl -CSD -Mutf8 -Mfeature=unicode_strings -p -i -e "s/\b$id\b/$x/g" x_graph.dot
Note the use of single quotation marks instead of double so as to avoid unintended interpolation.
Upvotes: 3