ItsPronounced
ItsPronounced

Reputation: 5463

MYSQL database encoding, mix of latin1 and utf-8

I have an older MYSQL 5.6.34 database that was created several years ago (not by me). I've taken one of the databases and started building tables in it for use in my applications when I noticed some of the other tables have a latin1 encoding. Then I noticed all the databases, with the exception of the information_schema all use latin1.

SCHEMA_NAME         DEFAULT_CHARACTER_SET_NAME  DEFAULT_COLLATION_NAME
information_schema  utf8                        utf8_general_ci
443347_pxa          latin1                      latin1_swedish_ci
443348_srp          latin1                      latin1_swedish_ci
warehouseinventory  latin1                      latin1_swedish_ci

I'm not sure why the original developer used latin1 for encoding and collation. There's no reason why really.

I'm don't really want to change any encoding on any existing databases, I just want to create new tables with utf8mb4. But the more I think about it, to stay consistent I'd maybe want to change them. The last database I created was warehouseinventory but I didn't notice the encoding, and I assume it defaulted to latin1 based on the encoding of the previously created databases.

Is there any reason why I'd leave these at latin1 and would it hurt to change them? One of the databases does have spanish language stored in it, but I always thought that was just UTF.

Upvotes: 1

Views: 5136

Answers (1)

Rick James
Rick James

Reputation: 142238

If you have any indexes on column(s) that are declared VARCHAR(255), you could get an error about "index bigger than 767 bytes" when converting to utf8mb4. There are 5 workarounds .

If you don't have any varchars between 191 and 255 characters that are indexed, then

ALTER TABLE tbl CONVERT TO utf8mb4;

will convert all columns in the table tbl to utf8mb4.

If you happen to have incorrectly stored utf8 bytes into latin1 in "double encoding", you need a different fix. Review Fixes for various Cases .

The "default" character set or collation is just that -- a default. That is, when a new column or table is created, it takes on the default. If you explicitly specify charset and/or collation for a column, that overrides the default for the table.

There's an old saying: "If it ain't broke, don't fix it." I lean toward making new databases/tables/columns utf8mb4, but leaving the old ones alone. Note: Having a mixture is OK. The connection specifies what encoding the client uses. MySQL, during INSERT or SELECT will convert from/to the client's settings to/from the column's settings.

Spanish has a limited set of characters. That set is a subset of latin1, utf8, and utf8mb4, so any of those charsets work with Spanish. However, the encoding is different. For example, ñ is

Hex F1   in latin1
Hex C3B1 in utf8 and utf8mb4

Similarly for the rest of Western Europe. Asia needs at least utf8. Emoji and some Chinese characters need utf8mb4.

If you end up with question marks, Mojibake, etc, debug it here .

Terminology: Outside MySQL: UTF-8; Inside MySQL: utf8mb4. Those are essentially the same. MySQL's utf8 is a subset of them.

Tést¥, in hex:

54 E9 73 74 A5 -- cp1256, dec8, latin1, latin5  encoding
54 C3A9 73 74 C2A5 -- utf8, utf8mb4  encoding
54 C383 C2A9 73 74 C382 C2A5 -- "double encoded"; may show as Tést¥

Upvotes: 3

Related Questions