Perl multi-byte character encoding for HTML

Question

I'm being passed a string such as:

my $x = "Zakłady Kuźnicze";

If you examine it closer you see that those two weird letters are actually composed of two bytes:

foreach (split(//, $x)) { print $_.' '.ord($_)."
"; }

Z 90
a 97
k 107
� 197
� 130
a 97
d 100
y 121
  32
K 75
u 117
� 197
� 186
n 110
i 105
c 99
z 122
e 101

I want to convert this to encoded HTML using the codes described here: https://www.w3schools.com/charsets/ref_utf_latin_extended_a.asp

So I need a function such that:

print encode_it($x)."
";

yields:

Zakłady Kužnice

I've tried HTML::Entities::encode and HTML::Entities::encode_numeric, but these yield:

ZakÅ‚ady KuÅºnicze

ZakÅ‚ady KuÅºnicze

Which does not help, it renders as:

ZakÅ‚ady KuÅºnicze

Can anyone advise how to achieve this?

EDIT:

Like ikegami showed it works if use utf8 is used AND the string is set in the program:

perl -e 'use utf8; chomp; printf "%X
", ord for split //, "Zakłady Kuźnicze"'
5A
61
6B
142
61
64
79
20
4B
75
17A
6E
69
63
7A
65

...but my input is actually coming in via STDIN, and it's not working from STDIN:

echo "Zakłady Kuźnicze" | perl -ne 'use utf8; chomp; printf "%X
", ord for split //'
5A
61
6B
C5
82
61
64
79
20
4B
75
C5
BA
6E
69
63
7A
65

What subtlety am I missing here?

ikegami · Accepted Answer

Perl expects the source to be either ASCII^[1] (no utf8;, the default) or UTF-8 (use utf8;). You appear to have a file encoded using UTF-8, but you didn't tell Perl that, so it sees

my $x = "Zak\xC5\x82ady Ku\xC5\xBAnicze";

rather than the intended

my $x = "Zak\x{142}ady Ku\x{17A}nicze";

Example (UTF-8 terminal):

$ diff -U 0 \
   <( perl -e'no utf8;  printf "%X
", ord for split //, "Zakłady Kuźnicze"' ) \
   <( perl -e'use utf8; printf "%X
", ord for split //, "Zakłady Kuźnicze"' )
--- /dev/fd/63  2020-01-17 20:04:23.407591294 -0800
+++ /dev/fd/62  2020-01-17 20:04:23.407591294 -0800
@@ -4,2 +4 @@
-C5
-82
+142
@@ -12,2 +11 @@
-C5
-BA
+17A

Add use utf8;.

An 8-bit clean version of ASCII, meaning that any byte with the 8th bit set in a string or regex literal results in a character with the same value.

Perl multi-byte character encoding for HTML

Answers (2)

Related Questions