user2266317
user2266317

Reputation: 3

Different utf8 encodings?

I`ve a small issue with utf8 encoding. the word i try to encode is "kühl". So it has a special character in it.

When i encode this string with utf8 in the first file i get:

kühl

When i encode this string with utf8 in the second file i get:

ku�hl

With php utf8_encode() i always get the first one (kühl) as an output, but i would need the second one as output (ku�hl).

mb_detect_encoding tells me for both it is "UTF-8", so this does not really help.

do you have any ideas to get the second one as output? thanks in advance!

Upvotes: 0

Views: 160

Answers (2)

tripleee
tripleee

Reputation: 189387

There is only one encoding called UTF-8 but there are multiple ways to represent some glyphs in Unicode. U+00FC is the Latin-1 compatibility single-glyph precomposed ü which displays as kühl in Latin-1 whereas off the top of my head kuÌ�hl looks like a fully decomposed expression of the same character, i.e. U+0075 (u) followed by U+0308 (combining diaeresis). See also http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | iconv -f latin1 -t utf8
ku�hl
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | xxd
0000000: 6b75 cc88 686c 0a                   ku..hl.

0x88 is not a valid character in Latin-1 so (in my browser) it displays as an "invalid character" placeholder (black diamond with a white question mark in it) whereas others might see something else, or nothing at all.

Apparently you could use class.normalize to convert between these two forms in PHP:

$normalized = Normalizer::normalize($input, Normalizer::FORM_D);

By the by, viewing UTF8 as Latin-1 and copy/pasting the representation as if it was actual real text is capricious at best. If you have character encoding questions, the actual bytes (for example, in hex) is the only portable, understandable way to express what you have. How your computer renders it is unpredictable in many scenarios, especially when the encoding is problematic or unknown. I have stuck with the presentation you used in your question, but if you have additional questions, take care to articulate the problem unambiguously.

Upvotes: 4

Evert
Evert

Reputation: 99533

utf8_encode, despite it's name, does not magically encode into UTF-8.

It will only work, if your source is ISO-8559-1, also known as latin-1.

If your source was already UTF-8 or any other encoding it will output broken data.

Upvotes: 0

Related Questions