Chuck Le Butt
Chuck Le Butt

Reputation: 48758

MySQL: 1366 Incorrect string value: '\xE0' in utf8mb4 column

I'm getting errors when certain characters are being added to a table... even when the column is has utf8mb4 character set. For example:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xE0' for column 'surname'

The data in question is: SEGAL‡ (note the double dagger)

Is this character beyond even 4 byte UTF8 or is the collation causing the issue? Or is it something else?

Screenshot showing character set and collation of the column:

screenshot showing utf8mb4 character set

It's a Laravel 8 app and the MySQL connection is configured to the following:

'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',

Looking at the CSV file in PHPStorm, non-ASCII characters are displayed as . I've tried explicitly setting the file encoding to UTF-8 in PHPStorm (with and without BOM).

If I open the CSV in Excel then the non-ASCII characters display correctly. Confused.

Update

Examining the CSV in a HEX editor shows that a character like ä is stored as a single byte (8A). When this CSV is opened in Excel it correctly shows ä, but in everything else it shows .

I don't know what character encoding Excel is using, as this character should be typically stored as E4 when using a single byte, or C3 A4 in UTF-8.

Upvotes: 3

Views: 2733

Answers (1)

Rick James
Rick James

Reputation: 142278

Double dagger is hex E2 80 A1 (only 3 bytes) when encoded in CHARACTER SET utf8 or utf8mb4. It is also available in latin1 as hex 87.

C3A4 [ä]LATIN SMALL LETTER A WITH DIAERESIS (in utf8/utf8mb4) or E4 in latin1.

Please provide SHOW VARIABLES LIKE 'char%';

See "black diamond" in Trouble with UTF-8 characters; what I see is not what I stored for the likely cause.

The CHARACTER SET is the "encoding" and is relevant; the COLLATION is how characters compare (eg case-folding) and is not relevant.

Nothing in the Question hints of a BOM (http://mysql.rjweb.org/doc.php/charcoll#bom_byte_order_mark) EF BB BF at the beginning of some UTF-8-encoded files.

Do not use ucs-2 (UTF-16) unless you happen to have a file encoded that way. Still, I would declare the tables as utf8, not ucs-2.

More

I plugged 8AE0E8 into a handy script (that does CONVERT(CONVERT(UNHEX('8ae0e8') USING %s) USING utf8mb4) and got:

ascii, big5, binary, cp932, eucjpms, euckr,
    gb18030, gb2312, gbk, sjis, ujis,
    utf8mb3, utf8mb4   ---invalid in these charsets
                                     swe7   3  3 '???'
                                   cp1257   5  3 '?ąč'
                                   cp1256   5  3 '?àè'
                                  geostd8   5  3 '?ჭ?'
                                   latin7   6  3 'ąč'
                             dec8, latin5   6  3 'àè'
                                      hp8   6  3 'ÁÒ'
                                    macce   6  3 'äŗŤ'
                                    cp850   6  3 'èÓÞ'
                                  keybcs2   6  3 'ĹαΦ'
                                    cp852   6  3 'ŐÓŔ'
                                   latin2   6  3 'ŕč'
                                   latin1   6  3 'Šàè'
                                   cp1250   6  3 'Šŕč'
                                    greek   6  3 'ΰθ'
                                    cp866   6  3 'Крш'
                                   cp1251   6  3 'Љаи'
                                 armscii8   6  3 'ՈՌ'
                                   hebrew   6  3 'אט'
                             koi8r, koi8u   7  3 '┼ЮХ'
                                 macroman   7  3 'ä‡Ë'
                                   tis620   8  3 'เ่'

It looks like macroman is the charset. I think that is specific to Apple.

Upvotes: 2

Related Questions