burnersk
burnersk

Reputation: 3480

What is the correct stored value for umlaut "ü" in Perl?

I would like to deliver UTF-8 websites with Perl directly. I ran into several encoding issues because the source data is not completely stored in UTF-8. Due to a debugging session for the encoding issues I discovered two different representations for the German umlaut ü. Which one is the correct stored value with Perl?

If there isn't any difference then why is Perl storing umlauts in different representations and does not store it in either the Unicode position or the UTF-8 hex representation.

Unicode/UTF-8 character table reference

Upvotes: 3

Views: 1695

Answers (3)

LeoNerd
LeoNerd

Reputation: 8532

Both of these are correct. It depends what your intentions are.

\xFC is the correct form of a string of Unicode text that contains the ü character. This is typically the form in which you process the string of text within your application.

0xC3 0xBC is the correct form of a string of bytes which encodes the ü character into UTF-8. This is typically the form in which you receive or transmit UTF-8 bytes from or to some external entity, such as a network socket or disk filehandle.

Upvotes: 2

ikegami
ikegami

Reputation: 385897

Use Encoding::FixLatin's fix_latin.

$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
   -E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC

Internally, it's best to work with Unicode. Decode inputs, encode outputs. You likely got the mix forgetting to encode an output.

Upvotes: 8

tripleee
tripleee

Reputation: 189457

There is no "correct", they are different representations. Generally speaking, it would probably be better to settle on Unicode and printing it out as UTF-8, but the main complication is really to know exactly what you have at each step of processing; if you can use UTF-8 reliably throughout, maybe that's simpler in your case.

Upvotes: 3

Related Questions