Reputation: 699
String substitution of UTF-8 encoded strings works fine when the regexp contains only ascii characters, but produces garbled output when regexp contains non-ascii.
my $str = "¿más?";
$str =~ s/[?]//g;
print "$str\n";
==> ¿más
$str =~ s/[¿]//g;
print "$str\n";
==> m�s
UPDATE: The answers to above made it clear that my original question was framed poorly. The answers focused on STDOUT, but in my actual problem, I am not printing to STDOUT. (I only did that to simplify the problem statement). In the actual problem, I retrieve data from sqlite store and use data as filenames to search file system. When I apply cleanup routines to the retrieved data, certain filenames get garbled.
One way to see this might be to simplify the example further:
my $str = "más";
$str =~ s/[?]//g;
print "$str\n";
==> más
$str =~ s/[¿]//g;
print "$str\n";
==> m�s
Now you can see that @ikegami's explanation does not apply. Something about the second s/// creates the problem. To be fair, both answers solved the problem as stated -- but any additional insights would be greatly appreciated!
UPDATE 2: As requested, have added sprintf's vector flag output. Note: Have also changed the target substitution character from ¿ to ¡ -- I now think that my code above (as @ikegami suggested) must have been copied incorrectly.
my $str = "más";
printf "%v02X\n", $str;
==> 6D.C3.A1.73
$str =~ s/[!]//g;
printf "%v02X\n", $str;
==> 6D.C3.A1.73
print "$str\n";
==> más
$str =~ s/[¡]//g;
printf "%v02X\n", $str;
==> 6D.C3.73
print "$str\n";
==> m�s
Upvotes: 2
Views: 245
Reputation: 385789
You are viewing your source code as if it was UTF-8, but unless you tell Perl it's UTF-8, it views it as US-ASCII.
You say you have the following:
my $str = "más";
printf "%v02X %s\n", $str, $str;
$str =~ s/[!]//g;
printf "%v02X %s\n", $str, $str;
$str =~ s/[¡]//g;
printf "%v02X %s\n", $str, $str;
But you really gave the equivalent of the following to Perl:
my $str = "m\xC3\xA1s";
printf "%v02X %s\n", $str, $str; # 6D.C3.A1.73 (the UTF-8 of más)
$str =~ s/[!]//g;
printf "%v02X %s\n", $str, $str; # 6D.C3.A1.73 (the UTF-8 of más)
$str =~ s/[\xC2\xA1]//g; # Replaces either of these bytes
printf "%v02X %s\n", $str, $str; # 6D.C3.73 (garbage)
You want the following:
use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal provides and expects UTF-8.
my $str = "más";
printf "U+%v04X %s\n", $str, $str; # U+006D.00E1.0073 (the Unicode of más)
$str =~ s/[¡]//g; # Aka s/[\x{0041}]//g
printf "U+%v04X %s\n", $str, $str; # U+006D.00E1.0073 (the Unicode of más)
You mention you didn't get your string from the source code and that you're not outputting the STDOUT, but the fix is the same: Decode inputs and encode outputs.
Upvotes: 3
Reputation: 35198
Specify the encoding of your source code using utf8
and output using binmode
:
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
my $str = "¿más?";
$str =~ s/[?]//g;
print "$str\n";
$str = "¿más?";
$str =~ s/[¿]//g;
print "$str\n";
Outputs:
¿más
más?
Upvotes: 3