700 Software
700 Software

Reputation: 87773

Reading unicode chars on the byte level

Suppose I wanted to detect unicode characters and encode them using \u notation. If I had to use a byte array, are there simple rules I can follow to detect groups of bytes that belong to a single character?

I am referring to UTF-8 bytes that need to be encoded for an ASCII-only receiver. At the moment, non-ASCII-Printable characters are stripped. s/[^\x20-\x7e\r\n\t]//g.

I want to improve this functionality to write \u0000 notation.

Upvotes: 1

Views: 364

Answers (1)

ikegami
ikegami

Reputation: 385546

You need to have Unicode characters, so start by decoding your byte array.

use Encode qw( decode );
my $decoded_text = decode("UTF-8", $encoded_text);

Only then can you escape Unicode characters.

( my $escaped_text = $decoded_text ) =~
   s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;

For example,

$ perl -CSDA -MEncode=decode -E'
   my $encoded_text = "\xC3\x89\x72\x69\x63\x20\xE2\x99\xA5\x20\x50\x65\x72\x6c";
   my $decoded_text = decode("UTF-8", $encoded_text);
   say $decoded_text;
   ( my $escaped_text = $decoded_text ) =~
      s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
   say $escaped_text;
'
Éric ♥ Perl
\u00C9ric \u2665 Perl

Upvotes: 2

Related Questions