Reputation: 2539
I have what i thought was going to be a simple web form until folks started copying and pasting text strings from Wikipedia that contain utf8 characters into an input field. My perl CGI script opens a MySQL DB connection and sets
$DBH->{mysql_enable_utf8} = 1;
$DBH->do("set names 'utf8';");
I am trying to use the Encode
module to decode, use and encode the target input value but that's not working as I expect. The web page is set with a utf8 character set.
My target string in this case is Baden-Württemberg
[copied from a Wikipedia page that lists German town names]. When the request is sent I can see the target string as: Baden-W%C3%BCrttemberg
. That is not flowing through my CGI script well though.
I have the following sample script:
#!/usr/local/bin/perl -w
use strict;
select(STDOUT);
$|++;
use feature 'unicode_strings';
use Encode;
use utf8;
binmode STDOUT, ":utf8";
my $thing = "Baden-Württemberg";
print STDOUT "$thing\n";
my $decodedThing = decode_utf8($thing);
print STDOUT encode_utf8($decodedThing) . "\n";
That value of $thing
has a 'u' with an umlaut over it just after the '-W'.
When I run the script I get:
# ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x72, immediately after start byte 0xfc) at ./test.pl line 13.
Baden-Wrttemberg
Baden-Wrttemberg
where did the u-umlaut go? How do I get it back?
Upvotes: 5
Views: 1894
Reputation: 2539
Turns out Rick James' last line Bottom line: You are not utf8 throughout the processing (bytes in hand, SET NAMES, CHARACTER SET, etc).
was the key. I do need the Encode module but only really for the DB insert statements, a la:
if (!($sth->execute(encode('UTF-8', $_))) && $DBI::err != 1062) {
die "DB execute failed :" . $DBI::err . ": " . $DBI::errstr;
}
Thanks to you all
Upvotes: 0
Reputation: 385976
You told Perl your source file was encoded using UTF-8.
use utf8;
It wasn't. ü
is represented by FC
instead of C3 BC
in your file. (That's why you are getting that "malformed" message.) Fix the encoding of your source file.
mv file.pl file.pl~ && piconv -f iso-8859-1 -t UTF-8 file.pl~ >file.pl
The following makes no sense:
my $decodedThing = decode_utf8($thing);
Because of use utf8;
, $thing
will already be decoded.
The following makes no sense:
print STDOUT encode_utf8($decodedThing);
You asked Perl to automatically encode every sent to STDOUT, so you're double encoding.
#!/usr/local/bin/perl
use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';
my $thing = "Baden-Württemberg";
printf "U+%v04X\n", $thing; # U+[...].0057.00FC.0072.[...]
print "$thing\n"; # Baden-Württemberg
Upvotes: 3
Reputation: 142316
%C3%BC
is the urlencode
for ü
. You do not want that for MySQL, though you might want it when building a URL.
ü
happens when you store utf8 bytes as if they were latin1 into a latin1 column. Please provide SHOW CREATE TABLE
.
I don't think you need encode/decode_utf8 for anything.
Malformed UTF-8 character (unexpected non-continuation byte 0x72, immediately after start byte 0xfc) at ./test.pl line 13.
indicates that you have hex FC
(which is the latin1 hex for ü
), but you are treating the string as utf8 ("unexpected ..") 72
is the r
that follows.
Bottom line: You are not utf8 throughout the processing (bytes in hand, SET NAMES, CHARACTER SET, etc).
Upvotes: 2