Reputation: 3480
I would like to deliver UTF-8 websites with Perl directly. I ran into several encoding issues because the source data is not completely stored in UTF-8. Due to a debugging session for the encoding issues I discovered two different representations for the German umlaut ü
. Which one is the correct stored value with Perl?
\xFC
, which is the Unicode position U+00FC
for ü
0xC3 0xBC
, which is the UTF-8 hex representation for ü
If there isn't any difference then why is Perl storing umlauts in different representations and does not store it in either the Unicode position or the UTF-8 hex representation.
Unicode/UTF-8 character table reference
Upvotes: 3
Views: 1695
Reputation: 8532
Both of these are correct. It depends what your intentions are.
\xFC
is the correct form of a string of Unicode text that contains the ü character. This is typically the form in which you process the string of text within your application.
0xC3 0xBC
is the correct form of a string of bytes which encodes the ü character into UTF-8. This is typically the form in which you receive or transmit UTF-8 bytes from or to some external entity, such as a network socket or disk filehandle.
Upvotes: 2
Reputation: 385897
Use Encoding::FixLatin's fix_latin
.
$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
-E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC
Internally, it's best to work with Unicode. Decode inputs, encode outputs. You likely got the mix forgetting to encode an output.
Upvotes: 8
Reputation: 189457
There is no "correct", they are different representations. Generally speaking, it would probably be better to settle on Unicode and printing it out as UTF-8, but the main complication is really to know exactly what you have at each step of processing; if you can use UTF-8 reliably throughout, maybe that's simpler in your case.
Upvotes: 3