Satishakumar Awati
Satishakumar Awati

Reputation: 3796

MYSQL warning Invalid utf8 character string

I am trying to use below query to enforce UTF-8 encoding on a column data.

SELECT convert(binary convert(summary using latin1) using utf8) FROM blogs b;

The below warnings getting logged in mysql console

enter image description here

What are below character represent in the above warnings? 'AE2065','927320','927320','927320','F16120','C02070','AE2043','F16F6C','93426F','93426F','E4646F','E46E73','E46DE4','E47974','C02070','934355','934355'

and how can I find out corresponding actual character symbols for above list?

Upvotes: 1

Views: 5541

Answers (2)

Rick James
Rick James

Reputation: 142298

Well, given these source character sets, you get this text. For example, several treat AE as ®

CONVERT(CONVERT(UNHEX('AE2065') USING %s) USING utf8mb4):

cp1250, cp1251, cp1256, cp1257, geostd8, hebrew, latin1, latin5, latin7
                                            4  3 '® e'
                    cp850, cp852, keybcs2   4  3 '« e'
                                    cp866   4  3 'о e'
                                      hp8   4  3 'Û e'
                                   latin2   4  3 'Ž e'
                                    macce   4  3 'ģ e'
                                 macroman   4  3 'Æ e'
                                 armscii8   5  3 '… e'
                              cp932, sjis   5  3 'ョ e'
                             koi8r, koi8u   5  3 '╝ e'
                                   tis620   5  3 'ฎ e'

For the 92, a likely interpretation is

CONVERT(CONVERT(UNHEX('927320') USING %s) USING utf8mb4)

                              cp932, sjis   4  2 '痴 '
                                    euckr   4  2 '뭩 '
                                      gbk   4  2 '抯 '
                                    cp850   4  3 'Æs '
                                    cp852   4  3 'ĺs '
                                    cp866   4  3 'Тs '
                                  keybcs2   4  3 'Žs '
                          macce, macroman   4  3 'ís '
cp1250, cp1251, cp1256, cp1257, geostd8, latin1
                                            5  3 '’s '

So far, I see that latin1 is common. I'll try another:

CONVERT(CONVERT(UNHEX('F16120') USING %s) USING utf8mb4)

                                     big5   4  2 '鎙 '
                                    cp932   4  2 ' '
                                      gbk   4  2 '馻 '
                                 armscii8   4  3 'րa '
cp1250, cp1257, dec8, latin1, latin2, latin5, latin7
                                            4  3 'ña '
                                   cp1251   4  3 'сa '
                           cp850, keybcs2   4  3 '±a '
                                    cp852   4  3 '˝a '
                                    cp866   4  3 'ёa '
                                    greek   4  3 'ρa '
                                   hebrew   4  3 'סa '
                                      hp8   4  3 'þa '
                             koi8r, koi8u   4  3 'Яa '
                                    macce   4  3 'Ůa '
                                 macroman   4  3 'Òa '
                                   tis620   5  3 '๑a '

Again, latin1 seems likely.

So,... What you need to do is declare that the bytes that you are feeding to MySQL are encoded in latin1. Then they will be automatically converted to the declaration of the column, which should be utf8 (or utf8mb4 if you want Chinese and Emoji).

Since you seem to already have latin1 bytes in the table (in column summary), it is not a question of whether you have utf8 or not; you do not have utf8. And the column is not declared utf8.

So, what do you really want do do?? Perhaps you want to ALTER the table to change the character set to utf8? See

ALTER TABLE ... CONVERT TO ...

What you had was close:

CONVERT(BINARY CONVERT(UNHEX('AE2065') USING latin1) USING utf8) --> Warning
CONVERT(       CONVERT(UNHEX('AE2065') USING latin1) USING utf8) --> '® e'

But, then what will you do with it? If the column is latin1, and you stick the converted stuff back in, you will get "double encoding", making things worse.

Perhaps you also have utf8-encoded characters sitting in a latin1 column? That's a mess -- "double encoding".

See Trouble with UTF-8 characters; what I see is not what I stored then further elaborate on your Question.

Upvotes: 2

Subash
Subash

Reputation: 7266

I am guessing that your data is stored as latin-1 charset. You could convert it as following:

SELECT convert(cast(convert(summary using  latin1) as binary) using utf8) FROM blogs b;

Did you actually intend to store your data as latin-1? If you intend to store your summary as UTF-8 you could update your table as follows:

UPDATE blogs SET 
    summary=convert(cast(convert(summary using  latin1) as binary) using utf8)
WHERE 1

Upvotes: 0

Related Questions