Reputation: 3
I`ve a small issue with utf8 encoding. the word i try to encode is "kühl". So it has a special character in it.
When i encode this string with utf8 in the first file i get:
kühl
When i encode this string with utf8 in the second file i get:
ku�hl
With php utf8_encode() i always get the first one (kühl) as an output, but i would need the second one as output (ku�hl).
mb_detect_encoding tells me for both it is "UTF-8", so this does not really help.
do you have any ideas to get the second one as output? thanks in advance!
Upvotes: 0
Views: 160
Reputation: 189387
There is only one encoding called UTF-8 but there are multiple ways to represent some glyphs in Unicode. U+00FC is the Latin-1 compatibility single-glyph precomposed ü which displays as kühl in Latin-1 whereas off the top of my head kuÌ�hl looks like a fully decomposed expression of the same character, i.e. U+0075 (u) followed by U+0308 (combining diaeresis). See also http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | iconv -f latin1 -t utf8
ku�hl
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | xxd
0000000: 6b75 cc88 686c 0a ku..hl.
0x88 is not a valid character in Latin-1 so (in my browser) it displays as an "invalid character" placeholder (black diamond with a white question mark in it) whereas others might see something else, or nothing at all.
Apparently you could use class.normalize
to convert between these two forms in PHP:
$normalized = Normalizer::normalize($input, Normalizer::FORM_D);
By the by, viewing UTF8 as Latin-1 and copy/pasting the representation as if it was actual real text is capricious at best. If you have character encoding questions, the actual bytes (for example, in hex) is the only portable, understandable way to express what you have. How your computer renders it is unpredictable in many scenarios, especially when the encoding is problematic or unknown. I have stuck with the presentation you used in your question, but if you have additional questions, take care to articulate the problem unambiguously.
Upvotes: 4
Reputation: 99533
utf8_encode, despite it's name, does not magically encode into UTF-8.
It will only work, if your source is ISO-8559-1, also known as latin-1.
If your source was already UTF-8 or any other encoding it will output broken data.
Upvotes: 0