Different utf8 encodings?

Question

I`ve a small issue with utf8 encoding. the word i try to encode is "kühl". So it has a special character in it.

When i encode this string with utf8 in the first file i get:

kÃ¼hl

When i encode this string with utf8 in the second file i get:

kuÌ�hl

With php utf8_encode() i always get the first one (kÃ¼hl) as an output, but i would need the second one as output (kuÌ�hl).

mb_detect_encoding tells me for both it is "UTF-8", so this does not really help.

do you have any ideas to get the second one as output? thanks in advance!

tripleee · Accepted Answer

There is only one encoding called UTF-8 but there are multiple ways to represent some glyphs in Unicode. U+00FC is the Latin-1 compatibility single-glyph precomposed ü which displays as kÃ¼hl in Latin-1 whereas ~~off the top of my head~~ kuÌ�hl looks like a fully decomposed expression of the same character, i.e. U+0075 (u) followed by U+0308 (combining diaeresis). See also http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | iconv -f latin1 -t utf8
kuÌ�hl
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | xxd
0000000: 6b75 cc88 686c 0a                   ku..hl.

0x88 is not a valid character in Latin-1 so (in my browser) it displays as an "invalid character" placeholder (black diamond with a white question mark in it) whereas others might see something else, or nothing at all.

Apparently you could use class.normalize to convert between these two forms in PHP:

$normalized = Normalizer::normalize($input, Normalizer::FORM_D);

By the by, viewing UTF8 as Latin-1 and copy/pasting the representation as if it was actual real text is capricious at best. If you have character encoding questions, the actual bytes (for example, in hex) is the only portable, understandable way to express what you have. How your computer renders it is unpredictable in many scenarios, especially when the encoding is problematic or unknown. I have stuck with the presentation you used in your question, but if you have additional questions, take care to articulate the problem unambiguously.

Different utf8 encodings?

Answers (2)

Related Questions