Payel Senapati
Payel Senapati

Reputation: 1356

MySQL CHAR_LENGTH(str) sometimes produces different outputs for the same String literal when used with different character set introducers

This is weird. As per MySQL developer website -

MySQL CHAR_LENGTH(str)

Returns the length of the string str, measured in characters. A multibyte character counts as a single character. This means that for a string containing five 2-byte characters, LENGTH() returns 10, whereas CHAR_LENGTH() returns 5.

It clearly means that the output of CHAR_LENGTH(str) is independent of character set.

Now as given in -

Character String Literal Character Set and Collation

I can use introducer to SET CHARACTER SET of a String, as in -

SELECT 'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8mb4'abc' COLLATE utf8mb4_danish_ci;

In my case, I take the String literal "Hello", use introducer to set it's CHARACTER SET, and use it as argument to the MySQL CHAR_LENGTH(str) function. But weirdly enough, it sometimes produces different outputs when different CHARACTER SETS are used. Example -

SELECT CHAR_LENGTH(_utf8mb4"Hello") AS character_length;
+------------------+
| character_length |
+------------------+
|                5 |
+------------------+

SELECT CHAR_LENGTH(_latin1"Hello") AS character_length;
+------------------+
| character_length |
+------------------+
|                5 |
+------------------+

SELECT CHAR_LENGTH(_ucs2"Hello") AS character_length;
+------------------+
| character_length |
+------------------+
|                3 |
+------------------+

Here, "Hello" has 5 characters. The CHARACTER SETS _utf8mb4 and _latin1 displays the number of characters correctly. But the CHARACTER SET _ucs2 oddly displays the number 0f characters to be 3.

How is this happening?

Upvotes: 1

Views: 235

Answers (2)

Rick James
Rick James

Reputation: 142208

Looks like a bug:

SELECT HEX(_ucs2"HELLO"), HEX(CONVERT("HELLO" USING ucs2));
+-------------------+----------------------------------+
| HEX(_ucs2"HELLO") | HEX(CONVERT("HELLO" USING ucs2)) |
+-------------------+----------------------------------+
| 0048454C4C4F      | 00480045004C004C004F             |
+-------------------+----------------------------------+

It seems that the _ucs2 introducer is applying to only the first character.

I created bug https://bugs.mysql.com/bug.php?id=105394&thanks=4

From bug report

The _charset_name expression is formally called an introducer. It tells the parser, “the string that follows uses character set charset_name.” An introducer does not change the string to the introducer character set like CONVERT() would do. It does not change the string value, although padding may occur. The introducer is just a signal.

Upvotes: 1

nbk
nbk

Reputation: 49373

That is because the language displyed, need 3 charcaters for that word.

To extend my answer.

A character conversion is in best cases difficult and should be avoided at all costs.

First MySQL tries to convert the $ Byte characte4r into 2 Byte characters that ucs2 has.

The resulting bytes are then used to display the characters, what you see at the end of the Snippet.

So in a conversion of character sets you don't have letters you have like everywhere in computer science bytes, which with the Representation of lets say H, but the bytes could in another character set have another representation. Further you need always some rules, how you can convert one character set to another, if the number of bytes differ.

So i also extended the example, to show you that a deterministic convertion actually happen, if you look at the bytes or binary representation you will find the concrete alghorithm,that is used.

SELECT CHAR_LENGTH(_ucs2"Hello") AS character_length;
| character_length |
| ---------------: |
|                3 |
SELECT _ucs2"Hello";
| 䡥汬  |
| :------ |
| H敬汯 |
SELECT _ucs2"Hel";
| 䡥  |
| :--- |
| H敬 |
SELECT _ucs2"Hell";
| 䡥汬 |
| :----- |
| 䡥汬 |
SELECT _ucs2"Hellos";
| 䡥汬潳 |
| :-------- |
| 䡥汬潳 |

db<>fiddle here

Upvotes: 1

Related Questions