how do I determine if this is latin1 or utf8?

Question

I have a string "Artîsté" in latin1 table. I use a C mysql connector to get the string out of the table. I have character_set_connection set to utf8.

In the debugger it looks like :

"Art\xeest\xe9"

If I print the hex values with printf ("%02X", (unsigned char) a[i]); for each char I get

41 72 74 EE 73 74 E9

How do I know if it is utf8 or latin1?

Ivan Buttinoni · Accepted Answer

as yo can see in the schema of a UTF-8 sequence you can have 2 great possibilities:

1st bit = 0 (same as ascii), 1 byte per char having value <=0X7F
1st bit = 1 of utf-8 sequence, the sequence length is >= 2 bytes having value >= 0X80

This is iso-8859 encoding

41 72 74 *EE* 73 74 *E9*

only 2 stand alone bytes with values >= 0x80

ADD BEWARE

Be carefull! Even if you found a well formatted UTF-8 sequence, you cannot differentiate it from a bounch of ISO-8859 chars!

Answers (2)