David Tonhofer
David Tonhofer

Reputation: 15316

Perl: Packing a sequence of bytes into a string

I'm trying to run a simple test whereby I want to have differently formatted binary strings and print them out. In fact, I'm trying to investigate a problem whereby sprintf cannot deal with a wide-character string passed in for the placeholder %s.

In this case, the binary string shall just contain the Cyrillic "д" (because it's above ISO-8859-1)

The code below works when I use the character directly in the source.

But nothing that passes through pack works.

The code:

#!/usr/bin/perl

use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"

# https://perldoc.perl.org/open.html

use open qw(:std :encoding(UTF-8));

sub showme {
   my ($name,$ch) = @_;
   print "-------\n";
   print "This is test: $name\n";

   my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint

   {
      # https://perldoc.perl.org/bytes.html
      use bytes;
      my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
      my $txt  = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
      print $txt,"\n";
   }

   print $ch, "\n";
   print "Combine: $ch\n";
   print "Concat: " . $ch . "\n";
   print "Sprintf: " . sprintf("%s",$ch) . "\n";
   print "-------\n";
}


showme("Cryillic direct" , "д");
showme("Cyrillic UTF-8"  , pack("HH","D0","B4"));  # UTF-8 of д is D0B4
showme("Cyrillic UCS-2"  , pack("HH","04","34"));  # UCS-2 of д is 0434

Current output:

Looks good

-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes

д
Combine: д
Concat: д
Sprintf: д
-------

That's a no. Where does the 176 come from??

-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no

а
Combine: а
Concat: а
Sprintf: а
-------

This is even worse.

-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no

0
Combine: 0
Concat: 0
Sprintf: 0
-------

Upvotes: 2

Views: 884

Answers (3)

David Tonhofer
David Tonhofer

Reputation: 15316

Both are good answer. Here is a slight extension of Polar Bear's code to print details about the string:

use strict;
use warnings;
use feature 'say';

use utf8;
use Encode;

sub about {
   my($str) = @_;
   # https://perldoc.perl.org/bytes.html
   my $charlen = length($str);
   my $txt;
   {
      use bytes;
      my $mark = (utf8::is_utf8($str) ? "yes" : "no");
      my $bytelen = length($str);
      $txt  = sprintf("Length: %d byte, %d chars, utf-8: %s, contents: %vd\n", 
                      $bytelen,$charlen,$mark,$str);
   }
   return $txt;
}

my $str;

my $utf8   = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';

binmode STDOUT, ':utf8';

say 'UTF-8:   ' . $utf8;
say about($utf8);

{
   my $str = pack('H*',$ucs2be);
   say 'UCS-2BE: ' . decode('UCS-2BE',$str);
   say about($str);
}

{
   my $str = pack('H*',$ucs2le);
   say 'UCS-2LE: ' . decode('UCS-2LE',$str);
   say about($str);
}

{
   my $str = pack('H*',$utf16);
   say 'UTF-16:  '. decode('UTF16',$str);
   say about($str);
}

{
   my $str = pack('H*',$utf32);
   say  'UTF-32:  ' . decode('UTF32',$str);
   say about($str);
}

# Try identity transcoding

{
   my $str_encoded_in_utf16 = encode('UTF16',$utf8);
   my $str = decode('UTF16',$str_encoded_in_utf16);
   say 'The same: ' . $str;
   say about($str);
}

Running this gives:

UTF-8:   Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176

UCS-2BE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48

UCS-2LE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 31.4.64.4.56.4.50.4.53.4.66.4.32.0.28.4.62.4.65.4.58.4.50.4.48.4

UTF-16:  Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48

UTF-32:  Привет Москва
Length: 52 byte, 52 chars, utf-8: no, contents: 0.0.4.31.0.0.4.64.0.0.4.56.0.0.4.50.0.0.4.53.0.0.4.66.0.0.0.32.0.0.4.28.0.0.4.62.0.0.4.65.0.0.4.58.0.0.4.50.0.0.4.48

The same: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176

And a little diagram I made as an overview for next time, covering encode, decode and pack. Because one better be ready for next time.

perl_strings_and_encode_decode

(The above diagram & its graphml file available here)

Upvotes: 1

Polar Bear
Polar Bear

Reputation: 6798

Please see if following demonstration code of any help

use strict;
use warnings;
use feature 'say';

use utf8;     # https://perldoc.perl.org/utf8.html
use Encode;   # https://perldoc.perl.org/Encode.html

my $str;

my $utf8   = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';

# https://perldoc.perl.org/functions/binmode.html

binmode STDOUT, ':utf8'; 

# https://perldoc.perl.org/feature.html#The-'say'-feature

say 'UTF-8:   ' . $utf8;  

# https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API

$str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);  

$str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);

$str = pack('H*',$utf16);
say 'UTF-16:  '. decode('UTF16',$str);

$str = pack('H*',$utf32);
say 'UTF-32:  ' . decode('UTF32',$str);

Output

UTF-8:   Привет Москва
UCS-2BE: Привет Москва
UCS-2LE: Привет Москва
UTF-16:  Привет Москва
UTF-32:  Привет Москва

Supported Cyrillic encodings

use strict;
use warnings;
use feature 'say';

use Encode;
use utf8;

binmode STDOUT, ':utf8';

my $utf8 = 'Привет Москва';
my @encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;

say '
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8       ', $utf8;

for (@encodings) {
    printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
}

Output

:: Supported Cyrillic encoding
---------------------------------------------
UTF-8       Привет Москва
UCS-2       041f044004380432043504420020041c043e0441043a04320430
UCS-2LE     1f044004380432043504420420001c043e0441043a0432043004
UCS-2BE     041f044004380432043504420020041c043e0441043a04320430
UTF-16      feff041f044004380432043504420020041c043e0441043a04320430
UTF-32      0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
ISO-8859-5  bfe0d8d2d5e220bcdee1dad2d0
CP855       dde1b7eba8e520d3d6e3c6eba0
CP1251      cff0e8e2e5f220cceef1eae2e0
KOI8-F      f0d2c9d7c5d420edcfd3cbd7c1
KOI8-R      f0d2c9d7c5d420edcfd3cbd7c1
KOI8-U      f0d2c9d7c5d420edcfd3cbd7c1

Documentation Encode::Supported

Upvotes: 1

ikegami
ikegami

Reputation: 385556

You have two problems.


Your calls to pack are incorrect

Each H represents one hex digit.

$ perl -e'printf "%vX\n", pack("HH", "D0", "B4")'       # XXX
D0.B0

$ perl -e'printf "%vX\n", pack("H2H2", "D0", "B4")'     # Ok
D0.B4

$ perl -e'printf "%vX\n", pack("(H2)2", "D0", "B4")'    # Ok
D0.B4

$ perl -e'printf "%vX\n", pack("(H2)*", "D0", "B4")'    # Better
D0.B4

$ perl -e'printf "%vX\n", pack("H*", "D0B4")'           # Alternative
D0.B4

STDOUT is expecting decoded text, but you are providing encoded text

First, let's take a look at strings you are producing (once the problem mentioned above is fixed). All you need for that is the %vX format, which provides the period-separated value of each character in hex.

  • "д" produces a one-character string. This character is the Unicode Code Point for д.

    $ perl -e'use utf8; printf("%vX\n", "д");'
    434
    
  • pack("H*", "D0B4") produces a two-character string. These characters are the UTF-8 encoding of д.

    $ perl -e'printf("%vX\n", pack("H*", "D0B4"));'
    D0.B4
    
  • pack("H*", "0434") produces a two-character string. These characters are the UCS-2be and UTF-16be encodings of д.

    $ perl -e'printf("%vX\n", pack("H*", "0434"));'
    4.34
    

Normally, a file handle expects a string of bytes (characters with values in 0..255) to be printed to it. These bytes are output verbatim.[1][2]

When an encoding layer (e.g. :encoding(UTF-8)) is added to a file handle, it expects a string of Unicode Code Points (aka decoded text) to be printed to it instead.

Your program adds an encoding layer to STDOUT (through its use of the use open pragma), so you must provide UCP (decoded text) to print and say. You can obtain decoded text from encoded text using, for example, Encode's decode function.

use utf8;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );

use Encode qw( decode );

say "д";                   # ok  (UCP of "д")
say pack("H*", "D0B4");    # XXX (UTF-8 encoding of "д")
say pack("H*", "0434");    # XXX (UCS-2be and UTF-16be encoding of "д")

say decode("UTF-8",    pack("H*", "D0B4"));   # ok (UCP of "д")
say decode("UCS-2be",  pack("H*", "0434"));   # ok (UCP of "д")
say decode("UTF-16be", pack("H*", "0434"));   # ok (UCP of "д")

For the UTF-8 case, I need to set the UTF-8 flag on

No, you need to decode the strings.

The UTF-8 flag is irrelevant. Whether the flag is set or not originally is irrelevant. Whether the flag is set or not after the string is decoded is irrelevant. The flag indicates how the string is stored internally, something you shouldn't care about.

For example, take

use strict;
use warnings;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );

my $x = chr(0xE9);

utf8::downgrade($x);   # Tell Perl to use the UTF8=0 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;

utf8::upgrade($x);   # Tell Perl to use the UTF8=1 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;

It outputs

UTF8=0 E9 é
UTF8=1 E9 é

Regardless of the UTF8 flag, the UTF-8 encoding (C3 A9) of the provided UCP (U+00E9) is output.


I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?

At best, one could employ heuristics to guess whether a string is encoded using iso-latin-1 or UCS-2be. I suspect one could get rather accurate results (like those you'd get for iso-latin-1 and UTF-8.)

I'm not sure why you bring up iso-latin-1 since nothing else in your question relates to iso-latin-1.


  1. Except on Windows, where a :crlf layer added to handles by default.

  2. You get a Wide character warning if you provide a string that contains a character that's not a byte, and the utf8 encoding of the string is output instead.

Upvotes: 4

Related Questions