Reputation: 13477
I'm on perl-5.24 and I stumbled upon \b
being not unicode aware:
$ echo '""test"" ""тест""' | perl -pe 's/""\b/“/g'
“test"" ""тест""
where as I expected it to be “test"" “тест""
.
Then I learned about unicode extensions in perl-5.22.1 regex, in particular this one: \b{wb}
. But with these extensions I still get wrong results:
$ echo '""test"" ""тест""' | perl -pe 's/""\b{wb}/“/g'
“test“ “тест“
where as I expected it to be “test"" “тест""
.
My question is: who do I transform ""test"" ""тест""
to “test"" “тест""
via perl regex?
Upvotes: 0
Views: 183
Reputation: 385897
You told s///
to match against the following:
22.22.74.65.73.74.22.22.20.22.22.D1.82.D0.B5.D1.81.D1.82.22.22.A
s///
(or more specifically, \b
), expects Unicode Code Points, so that means the above is treated as
""test"" ""Ñ<82>еÑ<81>Ñ<82>""
That's obviously not what you want the string to be.
Similarly, you claim your code contains the following:
s/""\b/“/g
Perl expects the script to be encoded using ASCII unless you encode the script using UTF-8 and add use utf8;
to let it know.
Decode inputs. Encode outputs.
$ echo '""test"" ""тест""' | perl -pe'
use utf8;
use open ":std", ":encoding(UTF-8)";
s/""\b/“/g
'
“test"" “тест""
or
$ echo '""test"" ""тест""' | perl -CSDA -Mutf8 -pe's/""\b/“/g'
“test"" “тест""
Upvotes: 7