What is collation at a low level?

Question

Let's say I have two tables:

CREATE TABLE `table_1` (
  `field` varchar(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And

CREATE TABLE `table_2` (
  `field` varchar(20) COLLATE utf8_unicode_520_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_520_ci;

The tables are the same except for their field collation. The first table uses (the default) utf_general_ci and the second table uses utf8_unicode_520_c1.

And I have inserted the value abcdé into both tables. I would assume (maybe incorrectly) that both of these characters would be stored with the binary value of: 61 62 63 C3A9. And both tables return that result when doing:

select field, hex(field) from table_1;
# abcé  616263C3A9

select field, hex(field) from table_2;
# abcé  616263C3A9

# Both return the same result

Does this mean the value is stored as that binary value or is there more to it as to how a utf-8 string is stored? Now, when I try doing a UNION or JOIN, it doesn't allow me to do so due to a mix of collations as it says:

select field, hex(field) from table_1 union
select field, hex(field) from table_2;

Illegal mix of collations for operation 'UNION'

Why does this occur then, and back to my initial question: what is the meaning of collation at the lower/storage level? I thought it was just used for a 'sort' (i.e., not at the storage level, but at the query/algorithmic level).

Rick James · Accepted Answer

You are interested in the effect of collation on JOIN and UNION?

First, a review of charset/collation... The C3A9 is the encoding, as indicated by the CHARACTER SET of utf8 or utf8mb4. The COLLATION say whether é > e or é < e or é = e. Or any other character. It's an algorithm, not an encoding. In English, the collation says whether or not A = a (case sensitivity) or not.

For JOIN...

FROM a
JOIN b  ON a.x = b.x

The Optimizers needs x in the two tables to have the same charset and collation. That way, INDEX(x) can be use to efficiently do the JOIN..ON.

For UNION DISTINCT ...

The column(s) that are involved in DISTINCTifying need to be compared. Comparing is efficient with an index. Without the same collation, there may be a definitional problem of what to do.

For UNION ALL ...

The result is a "table" with columns. Those columns have datatypes including collation. The simple approach is to demand all the collations be the same. More complex would be to convert on the fly.

The manual (on UNION) says (imprecisely) "For example, the first column selected by the first statement should have the same type as the first column selected by the other statements. If the data types of corresponding SELECT columns do not match, the types and lengths of the columns in the UNION result take into account the values retrieved by all the SELECT statements."

What is collation at a low level?

Answers (1)

Related Questions