Unicode-aware word-boundary in perl

Question

I'm on perl-5.24 and I stumbled upon \b being not unicode aware:

$ echo '""test"" ""тест""' | perl -pe 's/""\b/“/g'
“test"" ""тест""

where as I expected it to be “test"" “тест"".

Then I learned about unicode extensions in perl-5.22.1 regex, in particular this one: \b{wb}. But with these extensions I still get wrong results:

$ echo '""test"" ""тест""' | perl -pe 's/""\b{wb}/“/g'
“test“ “тест“

where as I expected it to be “test"" “тест"".

My question is: who do I transform ""test"" ""тест"" to “test"" “тест"" via perl regex?

ikegami · Accepted Answer

You told s/// to match against the following:

22.22.74.65.73.74.22.22.20.22.22.D1.82.D0.B5.D1.81.D1.82.22.22.A

s/// (or more specifically, \b), expects Unicode Code Points, so that means the above is treated as

""test"" ""Ñ<82>ÐµÑ<81>Ñ<82>""

That's obviously not what you want the string to be.

Similarly, you claim your code contains the following:

s/""\b/“/g

Perl expects the script to be encoded using ASCII unless you encode the script using UTF-8 and add use utf8; to let it know.

Decode inputs. Encode outputs.

$ echo '""test"" ""тест""' | perl -pe'
    use utf8;
    use open ":std", ":encoding(UTF-8)";
    s/""\b/“/g
'
“test"" “тест""

or

$ echo '""test"" ""тест""' | perl -CSDA -Mutf8 -pe's/""\b/“/g'
“test"" “тест""

Answers (1)