Phil Mitchell
Phil Mitchell

Reputation: 699

Perl string substitution garbles Unicode string

String substitution of UTF-8 encoded strings works fine when the regexp contains only ascii characters, but produces garbled output when regexp contains non-ascii.

my $str = "¿más?";

$str =~ s/[?]//g; 
print "$str\n";

==> ¿más

$str =~ s/[¿]//g; 
print "$str\n";

==> m�s

UPDATE: The answers to above made it clear that my original question was framed poorly. The answers focused on STDOUT, but in my actual problem, I am not printing to STDOUT. (I only did that to simplify the problem statement). In the actual problem, I retrieve data from sqlite store and use data as filenames to search file system. When I apply cleanup routines to the retrieved data, certain filenames get garbled.

One way to see this might be to simplify the example further:

my $str = "más";

$str =~ s/[?]//g; 
print "$str\n";

==> más

$str =~ s/[¿]//g; 
print "$str\n";

==> m�s

Now you can see that @ikegami's explanation does not apply. Something about the second s/// creates the problem. To be fair, both answers solved the problem as stated -- but any additional insights would be greatly appreciated!

UPDATE 2: As requested, have added sprintf's vector flag output. Note: Have also changed the target substitution character from ¿ to ¡ -- I now think that my code above (as @ikegami suggested) must have been copied incorrectly.

my $str = "más";
printf "%v02X\n", $str;

==> 6D.C3.A1.73

$str =~ s/[!]//g; 
printf "%v02X\n", $str;

==> 6D.C3.A1.73

print "$str\n";

==> más

$str =~ s/[¡]//g; 
printf "%v02X\n", $str;

==> 6D.C3.73

print "$str\n";

==> m�s

Upvotes: 2

Views: 245

Answers (2)

ikegami
ikegami

Reputation: 385789

You are viewing your source code as if it was UTF-8, but unless you tell Perl it's UTF-8, it views it as US-ASCII.

You say you have the following:

my $str = "más";
printf "%v02X %s\n", $str, $str;
$str =~ s/[!]//g; 
printf "%v02X %s\n", $str, $str;
$str =~ s/[¡]//g; 
printf "%v02X %s\n", $str, $str;

But you really gave the equivalent of the following to Perl:

my $str = "m\xC3\xA1s";
printf "%v02X %s\n", $str, $str;   # 6D.C3.A1.73 (the UTF-8 of más)
$str =~ s/[!]//g; 
printf "%v02X %s\n", $str, $str;   # 6D.C3.A1.73 (the UTF-8 of más)
$str =~ s/[\xC2\xA1]//g;           # Replaces either of these bytes
printf "%v02X %s\n", $str, $str;   # 6D.C3.73 (garbage)

You want the following:

use utf8;                             # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';  # Terminal provides and expects UTF-8.

my $str = "más";
printf "U+%v04X %s\n", $str, $str;   # U+006D.00E1.0073 (the Unicode of más)
$str =~ s/[¡]//g;                    # Aka s/[\x{0041}]//g
printf "U+%v04X %s\n", $str, $str;   # U+006D.00E1.0073 (the Unicode of más)

You mention you didn't get your string from the source code and that you're not outputting the STDOUT, but the fix is the same: Decode inputs and encode outputs.

Upvotes: 3

Miller
Miller

Reputation: 35198

Specify the encoding of your source code using utf8 and output using binmode:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;

binmode STDOUT, ':encoding(UTF-8)';

my $str = "¿más?";

$str =~ s/[?]//g; 
print "$str\n";

$str = "¿más?";
$str =~ s/[¿]//g; 
print "$str\n";

Outputs:

¿más
más?

Upvotes: 3

Related Questions