Reputation: 2539

Perl string manipulation and utf8/unicode

I have what i thought was going to be a simple web form until folks started copying and pasting text strings from Wikipedia that contain utf8 characters into an input field. My perl CGI script opens a MySQL DB connection and sets

$DBH->{mysql_enable_utf8} = 1;
$DBH->do("set names 'utf8';");

I am trying to use the Encode module to decode, use and encode the target input value but that's not working as I expect. The web page is set with a utf8 character set.

My target string in this case is Baden-Württemberg [copied from a Wikipedia page that lists German town names]. When the request is sent I can see the target string as: Baden-W%C3%BCrttemberg. That is not flowing through my CGI script well though.

I have the following sample script:

#!/usr/local/bin/perl -w

use strict;
select(STDOUT);
$|++;

use feature 'unicode_strings';
use Encode;
use utf8;

binmode STDOUT, ":utf8";

my $thing = "Baden-Württemberg";
print STDOUT "$thing\n";

my $decodedThing = decode_utf8($thing);
print STDOUT encode_utf8($decodedThing) . "\n";

That value of $thing has a 'u' with an umlaut over it just after the '-W'.

When I run the script I get:

# ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x72, immediately after start byte 0xfc) at ./test.pl line 13.
Baden-Wrttemberg
Baden-Wrttemberg

where did the u-umlaut go? How do I get it back?

Upvotes: 5

Answers (3)

7 Reeds

Reputation: 2539

Turns out Rick James' last line Bottom line: You are not utf8 throughout the processing (bytes in hand, SET NAMES, CHARACTER SET, etc). was the key. I do need the Encode module but only really for the DB insert statements, a la:

if (!($sth->execute(encode('UTF-8', $_))) && $DBI::err != 1062) {
    die "DB execute failed :" . $DBI::err . ": " . $DBI::errstr;
}

Thanks to you all

Upvotes: 0

ikegami

Reputation: 385976

Problem 1

You told Perl your source file was encoded using UTF-8.

use utf8;

It wasn't. ü is represented by FC instead of C3 BC in your file. (That's why you are getting that "malformed" message.) Fix the encoding of your source file.

mv file.pl file.pl~ && piconv -f iso-8859-1 -t UTF-8 file.pl~ >file.pl

Problem 2

The following makes no sense:

my $decodedThing = decode_utf8($thing);

Because of use utf8;, $thing will already be decoded.

Problem 3

The following makes no sense:

print STDOUT encode_utf8($decodedThing);

You asked Perl to automatically encode every sent to STDOUT, so you're double encoding.

Fixed

#!/usr/local/bin/perl

use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';

my $thing = "Baden-Württemberg";
printf "U+%v04X\n", $thing;     # U+[...].0057.00FC.0072.[...]
print "$thing\n";               # Baden-Württemberg

Upvotes: 3

Rick James

Reputation: 142316

%C3%BC is the urlencode for ü. You do not want that for MySQL, though you might want it when building a URL.

Ã¼ happens when you store utf8 bytes as if they were latin1 into a latin1 column. Please provide SHOW CREATE TABLE.

I don't think you need encode/decode_utf8 for anything.

Malformed UTF-8 character (unexpected non-continuation byte 0x72, immediately after start byte 0xfc) at ./test.pl line 13.

indicates that you have hex FC (which is the latin1 hex for ü), but you are treating the string as utf8 ("unexpected ..") 72 is the r that follows.

Bottom line: You are not utf8 throughout the processing (bytes in hand, SET NAMES, CHARACTER SET, etc).

Upvotes: 2