Reputation: 7711
I have a string "Artîsté" in latin1 table. I use a C mysql connector to get the string out of the table. I have character_set_connection set to utf8.
In the debugger it looks like :
"Art\xeest\xe9"
If I print the hex values with printf ("%02X", (unsigned char) a[i]); for each char I get
41 72 74 EE 73 74 E9
How do I know if it is utf8 or latin1?
Upvotes: 1
Views: 2586
Reputation: 279225
\x74\xee\x73
isn't a valid UTF-8 sequence, since UTF-8 never has a run of only 1 byte with the top bit set. So of the two, it must be Latin-1.
However, if you see bytes that are valid UTF-8 data, then it's not always possible to rule out that it might be Latin-1 that just so happens to also be valid UTF-8.
Latin-1 does have some invalid bytes (the ASCII control characters 0x00-0x1F
and the unused range 0x7f-0x9F
), so there are some UTF-8 strings that you can be sure are not Latin-1. But in my experience it's common enough to see Windows CP1252 mislabelled as Latin-1, that rejecting all those code points is fairly futile except in the case where you're converting from another charset to Latin-1, and want to be strict about what you output. CP1252 has a few unused bytes too, but not as many.
Upvotes: 5
Reputation: 4145
as yo can see in the schema of a UTF-8 sequence you can have 2 great possibilities:
This is iso-8859 encoding
41 72 74 *EE* 73 74 *E9*
only 2 stand alone bytes with values >= 0x80
ADD BEWARE
Be carefull! Even if you found a well formatted UTF-8 sequence, you cannot differentiate it from a bounch of ISO-8859 chars!
Upvotes: 1