Timothy B.
Timothy B.

Reputation: 655

Unicode support (including emoji) in web application with accent insensitive collation

I have a legacy web application using Perl, CGI, and MySQL 5.5.62. In fields completed by customers, I need to support the umlaut and emoji characters that they often use in their input.

In an attempt to learn, I set up the following stand-alone test. (It's deliberately very simple and lacking basic security checks on input.)

Dump of database widget:

DROP TABLE IF EXISTS `experiment`;
CREATE TABLE `experiment` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(40) CHARACTER SET utf8mb4 DEFAULT NULL,
  `content` text CHARACTER SET utf8mb4,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

LOCK TABLES `experiment` WRITE;
INSERT INTO `experiment` VALUES (1,'Record','Now is the time for all good men to come to the aid of their country. 😀\r\nThe quick brown fox jumped over the lazy dög.');
UNLOCK TABLES;

Perl code:

#!/usr/bin/perl -T

use strict;
use warnings;

use DBI;
use CGI '-utf8';

my $dbh = DBI->connect('DBI:mysql:widget','test','test', { mysql_enable_utf8 => 0,}) or die "Can't connect to the database: $DBI::errstr";
my $sth = $dbh->prepare('SELECT * FROM `experiment`') or die "Couldn't prepare statement: " . $dbh->errstr;
$sth->execute or die "Can't execute SQL statement: $DBI::errstr";
my $hashref = $sth->fetchrow_hashref or die "Can't fetchrow_hashref: $DBI::errstr\n";
$sth->finish;
my $search = '';
for my $i (qw(fox dog)) {
    $sth = $dbh->prepare("SELECT * FROM `experiment` WHERE `content` LIKE '%$i%'") or die "Couldn't prepare statement: " . $dbh->errstr;
    my $count = $sth->execute or die "Can't execute SQL statement: $DBI::errstr";
    $search .= "<h6>String: [$i] found [$count]</h6>";
}
$sth->finish;

my $action = CGI::param('action') || '';
if ($action eq 'save') {
    my $new = CGI::param('value') || '';
    $sth = $dbh->prepare("UPDATE `experiment` SET `content` = '$new' WHERE `id` = 1") or die "Couldn't prepare statement: " . $dbh->errstr;
    $sth->execute or die "Can't execute SQL statement: $DBI::errstr";
    $sth->finish;
    print "Location: http://simulated-domain-name.com/cgi-bin/test.cgi\n\n";
    exit;
}
$dbh->disconnect;
print <<EOF;
Content-type: text/html

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous"> <!-- not part of the experiment, just make it look nice -->
    </head>
    <body>
        <div class="container my-3">
            <h5>Content = $hashref->{content}</h5>
$search
            <form method="post">
                <input type="hidden" name="action" value="save">
                <div class="form-group">
                    <label class="font-weight-bold" for="exampleFormControlTextarea1">Content</label>
                    <textarea name="value" class="form-control" id="exampleFormControlTextarea1" rows="3">$hashref->{content}</textarea>
                </div>
                <button type="submit" class="btn btn-primary">Submit</button>
            </form>
        </div>
    </body>
</html>
EOF
exit;

As far as I can tell, the '-utf8' on the line with use CGI has no effect.

When using mysql_enable_utf8 = 0 the program works beautifully but accent insensitive search fails. Umlauts and emoji are properly displayed in the browser.

Output:

Content = Now is the time for all good men to come to the aid of their country. 😀 The quick brown fox jumped over the lazy dög. String: [fox] found [1] String: [dog] found [0E0]

When using `mysql_enable_utf8 = 1' the HTML output looks scrambled.

Output:

Content = Now is the time for all good men to come to the aid of their country. 😀 The quick brown fox jumped over the lazy dög. String: [fox] found [1] String: [dog] found [0E0]

I feel like I'm close, but missing something important.

Upvotes: 1

Views: 509

Answers (2)

Timothy B.
Timothy B.

Reputation: 655

This is the solution that worked for me on CentOS 7:

  • Install mysql-devel (via yum) because mysql_config was not on my system
  • Upgrade Perl DBD::mysql from 4.023 (via yum) to 4.050 (via CPAN)
  • Use mysql_enable_utf8mb4 option in DBI->connect
  • Add binmode(STDOUT, ":encoding(UTF-8)"); to top of Perl script

Output is now as desired:

Content = Now is the time for all good men to come to the aid of their country. 🤣 The quick brown fox jumped over the lazy dög. String: [fox] found [1] String: [dog] found [1]

Upvotes: 0

Rick James
Rick James

Reputation: 142503

Emoji --> CHARACTER SET utf8mb4.

Accent insensitive --> any COLLATION utf8mb4_..._ci.

Since you are on the relatively antique 5.5, you may run into the "767 problem". See http://mysql.rjweb.org/doc.php/limits#767_limit_in_innodb_indexes

If you have Question marks or Mojibake (such as dög for dög), see Trouble with UTF-8 characters; what I see is not what I stored

My notes on Perl:

use utf8;
use open ':std', ':encoding(UTF-8)';
my $dbh = DBI->connect("dbi:mysql:".$dsn, $user, $password, {
   PrintError => 0,
   RaiseError => 1,
   mysql_enable_utf8 => 1,  # Switch to UTF-8 for communication and decode.
});
# or {mysql_enable_utf8mb4 => 1} if using utf8mb4

(I don't have any notes on use CGI.)

Upvotes: 4

Related Questions