joels
joels

Reputation: 7711

how do I determine if this is latin1 or utf8?

I have a string "Artîsté" in latin1 table. I use a C mysql connector to get the string out of the table. I have character_set_connection set to utf8.

In the debugger it looks like :

"Art\xeest\xe9"

If I print the hex values with printf ("%02X", (unsigned char) a[i]); for each char I get

41 72 74 EE 73 74 E9

How do I know if it is utf8 or latin1?

Upvotes: 1

Views: 2586

Answers (2)

Steve Jessop
Steve Jessop

Reputation: 279225

\x74\xee\x73 isn't a valid UTF-8 sequence, since UTF-8 never has a run of only 1 byte with the top bit set. So of the two, it must be Latin-1.

However, if you see bytes that are valid UTF-8 data, then it's not always possible to rule out that it might be Latin-1 that just so happens to also be valid UTF-8.

Latin-1 does have some invalid bytes (the ASCII control characters 0x00-0x1F and the unused range 0x7f-0x9F), so there are some UTF-8 strings that you can be sure are not Latin-1. But in my experience it's common enough to see Windows CP1252 mislabelled as Latin-1, that rejecting all those code points is fairly futile except in the case where you're converting from another charset to Latin-1, and want to be strict about what you output. CP1252 has a few unused bytes too, but not as many.

Upvotes: 5

Ivan Buttinoni
Ivan Buttinoni

Reputation: 4145

as yo can see in the schema of a UTF-8 sequence you can have 2 great possibilities:

  • 1st bit = 0 (same as ascii), 1 byte per char having value <=0X7F
  • 1st bit = 1 of utf-8 sequence, the sequence length is >= 2 bytes having value >= 0X80

This is iso-8859 encoding

41 72 74 *EE* 73 74 *E9*

only 2 stand alone bytes with values >= 0x80

ADD BEWARE

Be carefull! Even if you found a well formatted UTF-8 sequence, you cannot differentiate it from a bounch of ISO-8859 chars!

Upvotes: 1

Related Questions