Reputation: 110163
Let's say I have two tables:
CREATE TABLE `table_1` (
`field` varchar(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
And
CREATE TABLE `table_2` (
`field` varchar(20) COLLATE utf8_unicode_520_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_520_ci;
The tables are the same except for their field collation. The first table uses (the default) utf_general_ci
and the second table uses utf8_unicode_520_c1
.
And I have inserted the value abcdé
into both tables. I would assume (maybe incorrectly) that both of these characters would be stored with the binary value of: 61
62
63
C3A9
. And both tables return that result when doing:
select field, hex(field) from table_1;
# abcé 616263C3A9
select field, hex(field) from table_2;
# abcé 616263C3A9
# Both return the same result
Does this mean the value is stored as that binary value or is there more to it as to how a utf-8
string is stored? Now, when I try doing a UNION
or JOIN
, it doesn't allow me to do so due to a mix of collations as it says:
select field, hex(field) from table_1 union
select field, hex(field) from table_2;
Illegal mix of collations for operation 'UNION'
Why does this occur then, and back to my initial question: what is the meaning of collation at the lower/storage level? I thought it was just used for a 'sort' (i.e., not at the storage level, but at the query/algorithmic level).
Upvotes: 0
Views: 123
Reputation: 142298
You are interested in the effect of collation on JOIN
and UNION
?
First, a review of charset/collation... The C3A9 is the encoding, as indicated by the CHARACTER SET of utf8 or utf8mb4. The COLLATION say whether é > e or é < e or é = e. Or any other character. It's an algorithm, not an encoding. In English, the collation says whether or not A = a (case sensitivity) or not.
For JOIN
...
FROM a
JOIN b ON a.x = b.x
The Optimizers needs x
in the two tables to have the same charset and collation. That way, INDEX(x)
can be use to efficiently do the JOIN..ON
.
For UNION DISTINCT
...
The column(s) that are involved in DISTINCTifying
need to be compared. Comparing is efficient with an index. Without the same collation, there may be a definitional problem of what to do.
For UNION ALL
...
The result is a "table" with columns. Those columns have datatypes including collation. The simple approach is to demand all the collations be the same. More complex would be to convert on the fly.
The manual (on UNION
) says (imprecisely) "For example, the first column selected by the first statement should have the same type as the first column selected by the other statements. If the data types of corresponding SELECT columns do not match, the types and lengths of the columns in the UNION result take into account the values retrieved by all the SELECT statements."
Upvotes: 1