Reputation: 215
I'm being passed a string such as:
my $x = "Zakłady Kuźnicze";
If you examine it closer you see that those two weird letters are actually composed of two bytes:
foreach (split(//, $x)) { print $_.' '.ord($_)."\n"; }
Z 90
a 97
k 107
� 197
� 130
a 97
d 100
y 121
32
K 75
u 117
� 197
� 186
n 110
i 105
c 99
z 122
e 101
I want to convert this to encoded HTML using the codes described here: https://www.w3schools.com/charsets/ref_utf_latin_extended_a.asp
So I need a function such that:
print encode_it($x)."\n";
yields:
Zakłady Kužnice
I've tried HTML::Entities::encode
and HTML::Entities::encode_numeric
, but these yield:
Zakłady Kuźnicze
Zakłady Kuźnicze
Which does not help, it renders as:
Zakłady Kuźnicze
Can anyone advise how to achieve this?
EDIT:
Like ikegami showed it works if use utf8
is used AND the string is set in the program:
perl -e 'use utf8; chomp; printf "%X\n", ord for split //, "Zakłady Kuźnicze"'
5A
61
6B
142
61
64
79
20
4B
75
17A
6E
69
63
7A
65
...but my input is actually coming in via STDIN, and it's not working from STDIN:
echo "Zakłady Kuźnicze" | perl -ne 'use utf8; chomp; printf "%X\n", ord for split //'
5A
61
6B
C5
82
61
64
79
20
4B
75
C5
BA
6E
69
63
7A
65
What subtlety am I missing here?
Upvotes: 2
Views: 458
Reputation: 9231
As @ikegami said, use utf8;
will decode your source code from UTF-8 so that string literals and other symbols can be interpreted as intended. Like the source code, input to your code is also in bytes, and usually UTF-8 encoded if it's text. So depending where it is coming from you have several options to decode it into useful characters. Below lists different options, you only need one for a particular stream of input.
From STDIN:
use open ':std', IN => ':encoding(UTF-8)'; # also affects read filehandles opened in this scope
use open ':std', ':encoding(UTF-8)'; # also affects STDOUT, STDERR, and all filehandles opened in this scope
binmode *STDIN, ':encoding(UTF-8)'; # STDIN only
Or these switches for oneliners:
-CI # STDIN only
-CS # STDIN, STDOUT, STDERR
-Mopen=':std,IN,:encoding(UTF-8)' # equivalent to first "use open" above
From handles you open yourself:
use open IN => ':encoding(UTF-8)'; # all read handles opened in this scope
use open ':encoding(UTF-8)'; # also affects write handles
open my $fh, '<:encoding(UTF-8)', 'example.txt' or die "Failed to open example.txt: $!";
binmode $fh, ':encoding(UTF-8)'; # to set on already opened handle
Or these switches for oneliners:
-Ci # read handles only
-CD # all handles opened
-Mopen='IN,:encoding(UTF-8)' # equivalent to first "use open" above
The above use open
and -C
options also apply to ARGV (the handle used by -n
, -p
, or the <>
/readline
operator to read filenames passed as arguments - this is different from when it is used to read STDIN). -C
switches can be combined, for example -CSD
will set it for STDIN/OUT/ERR as well as all handles opened.
Finally, you can decode the data itself after reading rather than affecting any handles globally (below assuming the data is in $_
):
utf8::decode($_) or die "Invalid UTF-8"; # in place, does not require "use utf8"
$_ = Encode::decode('UTF-8', $_); # with Encode loaded
$_ = Encode::Simple::decode_utf8($_); # with Encode::Simple loaded
Just remember if you want to output such decoded characters, or characters from literals with use utf8;
set for your source code, the STDOUT, STDERR, and other write handles need the same treatment, or you need to encode the data to UTF-8 before printing.
Some useful links:
Upvotes: 2
Reputation: 385764
Perl expects the source to be either ASCII[1] (no utf8;
, the default) or UTF-8 (use utf8;
). You appear to have a file encoded using UTF-8, but you didn't tell Perl that, so it sees
my $x = "Zak\xC5\x82ady Ku\xC5\xBAnicze";
rather than the intended
my $x = "Zak\x{142}ady Ku\x{17A}nicze";
Example (UTF-8 terminal):
$ diff -U 0 \
<( perl -e'no utf8; printf "%X\n", ord for split //, "Zakłady Kuźnicze"' ) \
<( perl -e'use utf8; printf "%X\n", ord for split //, "Zakłady Kuźnicze"' )
--- /dev/fd/63 2020-01-17 20:04:23.407591294 -0800
+++ /dev/fd/62 2020-01-17 20:04:23.407591294 -0800
@@ -4,2 +4 @@
-C5
-82
+142
@@ -12,2 +11 @@
-C5
-BA
+17A
Add use utf8;
.
Upvotes: 4