x-yuri
x-yuri

Reputation: 18963

mysql doesn't support all symbols from range U+0000..U+FFFF

Consider the following table:

CREATE TABLE t1 (f1 VARCHAR(255));

Then, be it ruby:

#!/usr/bin/env ruby
require 'json'
require 'sequel'
require 'mysql2'
DB = Sequel.connect(
    :adapter => 'mysql2',
    :database => 'd1',
    :user => '<user>',
    :password => '<password>',
    :encoding => 'utf8')
v1 = '{"a":"b\ud83c\udf4ec"}'
v2 = JSON.parse(v1)
p v2['a']
DB[:t1].truncate
DB[:t1].insert(f1: v2['a']);
p DB[:t1].first[:f1]

or php:

#!/usr/bin/env php
<?php
$dbh = new PDO('mysql:dbname=d1', '<user>', '<password>', [
    PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8',
    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
]);
$dbh->exec('TRUNCATE TABLE t1');
$v1 = '{"a":"b\ud83c\udf4ec"}';
$v2 = json_decode($v1);
var_dump($v2->a);
$sth = $dbh->prepare("INSERT INTO t1 VALUES (?)");
$sth->execute([$v2->a]);
$sth = $dbh->query("SELECT * FROM t1");
var_dump($sth->fetch()['f1']);

what gets in the database is b. I'm running mysql-5.1 and the documentation says:

MySQL 5.1 supports two character sets for storing Unicode data:

  • ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character.

  • utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character.

These two character sets support the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters have these characteristics:

  • Their code values are between 0 and 65535 (or U+0000 .. U+FFFF).

What am I doing wrong?

UPD

$ mysql -BNe 'SHOW CREATE TABLE t1' d1
t1  CREATE TABLE `t1` (\n  `f1` varchar(255) DEFAULT NULL\n) ENGINE=InnoDB DEFAULT CHARSET=utf8

Upvotes: 0

Views: 482

Answers (2)

x-yuri
x-yuri

Reputation: 18963

It appears those two escape sequences represent only one character: RED APPLE (U+1F34E). The first one being a surrogate. And surrogates are:

The UCS uses surrogates to address characters outside the initial Basic Multilingual Plane without resorting to more than 16 bit byte representations.

So that must be it, the resulting character is outside the BMP. And is not supported by mysql's utf8 character set as such.

Upvotes: 1

Benjamin Sonntag
Benjamin Sonntag

Reputation: 596

in my MySQL 5.1 (from debian) doing

CREATE TABLE t1 (f1 VARCHAR(255));

is effectively creating a LATIN1 table :

mysql> show CREATE TABLE t1 ;
+-------+---------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                |
+-------+---------------------------------------------------------------------------------------------+
| t1    | CREATE TABLE `t1` (
  `f1` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+---------------------------------------------------------------------------------------------+

So please check first that your MySQL really defaults to UTF-8.

Then, MySQL is known to NOT be able to store every character from BMP table. I don't find references about that, but saw it earlier.

So much that from mysql 5.5.3 introduced a new utf8mb4 full-unicode support character set as statu As stated here : https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html

Finally, even if BMP is saying they are between 0 and 0xFFFF it doesn't mean they are using all of this space as stated here : https://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane

Upvotes: 1

Related Questions