I would like to deliver UTF-8 websites with Perl directly. I ran into several encoding issues because the source data is not completely stored in UTF-8. Due to a debugging session for the encoding issues I discovered two different representations for the German umlaut ü . Which one is the correct stored value with Perl? \xFC , which is the Unicode position U+00FC for ü 0xC3 0xBC , which is the UTF-8 hex representation for ü If there isn't any difference then why is Perl storing umlauts in different representations and does not store it in either the Unicode position or the UTF-8 hex representation. Unicode/UTF-8 character table reference

Reputation: 3480

What is the correct stored value for umlaut "ü" in Perl?

I would like to deliver UTF-8 websites with Perl directly. I ran into several encoding issues because the source data is not completely stored in UTF-8. Due to a debugging session for the encoding issues I discovered two different representations for the German umlaut ü. Which one is the correct stored value with Perl?

\xFC, which is the Unicode position U+00FC for ü
0xC3 0xBC, which is the UTF-8 hex representation for ü

If there isn't any difference then why is Perl storing umlauts in different representations and does not store it in either the Unicode position or the UTF-8 hex representation.

Unicode/UTF-8 character table reference

Upvotes: 3

Answers (3)

LeoNerd

Reputation: 8532

Both of these are correct. It depends what your intentions are.

\xFC is the correct form of a string of Unicode text that contains the ü character. This is typically the form in which you process the string of text within your application.

0xC3 0xBC is the correct form of a string of bytes which encodes the ü character into UTF-8. This is typically the form in which you receive or transmit UTF-8 bytes from or to some external entity, such as a network socket or disk filehandle.

Upvotes: 2

ikegami

Reputation: 385897

Use Encoding::FixLatin's fix_latin.

$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
   -E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC

Internally, it's best to work with Unicode. Decode inputs, encode outputs. You likely got the mix forgetting to encode an output.

Upvotes: 8

tripleee

Reputation: 189457

There is no "correct", they are different representations. Generally speaking, it would probably be better to settle on Unicode and printing it out as UTF-8, but the main complication is really to know exactly what you have at each step of processing; if you can use UTF-8 reliably throughout, maybe that's simpler in your case.

Upvotes: 3

What is the correct stored value for umlaut &quot;&#252;&quot; in Perl?

Answers (3)

Related Questions

What is the correct stored value for umlaut "ü" in Perl?