Duke Leto
Duke Leto

Reputation: 215

Perl multi-byte character encoding for HTML

I'm being passed a string such as:

my $x = "Zakłady Kuźnicze";

If you examine it closer you see that those two weird letters are actually composed of two bytes:

foreach (split(//, $x)) { print $_.' '.ord($_)."\n"; }

Z 90
a 97
k 107
� 197
� 130
a 97
d 100
y 121
  32
K 75
u 117
� 197
� 186
n 110
i 105
c 99
z 122
e 101

I want to convert this to encoded HTML using the codes described here: https://www.w3schools.com/charsets/ref_utf_latin_extended_a.asp

So I need a function such that:

print encode_it($x)."\n";

yields:

Zakłady Kužnice

I've tried HTML::Entities::encode and HTML::Entities::encode_numeric, but these yield:

Zakłady Kuźnicze

Zakłady Kuźnicze

Which does not help, it renders as:

Zakłady Kuźnicze

Can anyone advise how to achieve this?

EDIT:

Like ikegami showed it works if use utf8 is used AND the string is set in the program:

perl -e 'use utf8; chomp; printf "%X\n", ord for split //, "Zakłady Kuźnicze"'
5A
61
6B
142
61
64
79
20
4B
75
17A
6E
69
63
7A
65

...but my input is actually coming in via STDIN, and it's not working from STDIN:

echo "Zakłady Kuźnicze" | perl -ne 'use utf8; chomp; printf "%X\n", ord for split //'
5A
61
6B
C5
82
61
64
79
20
4B
75
C5
BA
6E
69
63
7A
65

What subtlety am I missing here?

Upvotes: 2

Views: 458

Answers (2)

Grinnz
Grinnz

Reputation: 9231

As @ikegami said, use utf8; will decode your source code from UTF-8 so that string literals and other symbols can be interpreted as intended. Like the source code, input to your code is also in bytes, and usually UTF-8 encoded if it's text. So depending where it is coming from you have several options to decode it into useful characters. Below lists different options, you only need one for a particular stream of input.

From STDIN:

use open ':std', IN => ':encoding(UTF-8)'; # also affects read filehandles opened in this scope
use open ':std', ':encoding(UTF-8)'; # also affects STDOUT, STDERR, and all filehandles opened in this scope
binmode *STDIN, ':encoding(UTF-8)'; # STDIN only

Or these switches for oneliners:

-CI # STDIN only
-CS # STDIN, STDOUT, STDERR
-Mopen=':std,IN,:encoding(UTF-8)' # equivalent to first "use open" above

From handles you open yourself:

use open IN => ':encoding(UTF-8)'; # all read handles opened in this scope
use open ':encoding(UTF-8)'; # also affects write handles
open my $fh, '<:encoding(UTF-8)', 'example.txt' or die "Failed to open example.txt: $!";
binmode $fh, ':encoding(UTF-8)'; # to set on already opened handle

Or these switches for oneliners:

-Ci # read handles only
-CD # all handles opened
-Mopen='IN,:encoding(UTF-8)' # equivalent to first "use open" above

The above use open and -C options also apply to ARGV (the handle used by -n, -p, or the <>/readline operator to read filenames passed as arguments - this is different from when it is used to read STDIN). -C switches can be combined, for example -CSD will set it for STDIN/OUT/ERR as well as all handles opened.

Finally, you can decode the data itself after reading rather than affecting any handles globally (below assuming the data is in $_):

utf8::decode($_) or die "Invalid UTF-8"; # in place, does not require "use utf8"
$_ = Encode::decode('UTF-8', $_); # with Encode loaded
$_ = Encode::Simple::decode_utf8($_); # with Encode::Simple loaded

Just remember if you want to output such decoded characters, or characters from literals with use utf8; set for your source code, the STDOUT, STDERR, and other write handles need the same treatment, or you need to encode the data to UTF-8 before printing.

Some useful links:

Upvotes: 2

ikegami
ikegami

Reputation: 385764

Perl expects the source to be either ASCII[1] (no utf8;, the default) or UTF-8 (use utf8;). You appear to have a file encoded using UTF-8, but you didn't tell Perl that, so it sees

my $x = "Zak\xC5\x82ady Ku\xC5\xBAnicze";

rather than the intended

my $x = "Zak\x{142}ady Ku\x{17A}nicze";

Example (UTF-8 terminal):

$ diff -U 0 \
   <( perl -e'no utf8;  printf "%X\n", ord for split //, "Zakłady Kuźnicze"' ) \
   <( perl -e'use utf8; printf "%X\n", ord for split //, "Zakłady Kuźnicze"' )
--- /dev/fd/63  2020-01-17 20:04:23.407591294 -0800
+++ /dev/fd/62  2020-01-17 20:04:23.407591294 -0800
@@ -4,2 +4 @@
-C5
-82
+142
@@ -12,2 +11 @@
-C5
-BA
+17A

Add use utf8;.


  1. An 8-bit clean version of ASCII, meaning that any byte with the 8th bit set in a string or regex literal results in a character with the same value.

Upvotes: 4

Related Questions