Reputation: 87773
Suppose I wanted to detect unicode characters and encode them using \u
notation. If I had to use a byte array, are there simple rules I can follow to detect groups of bytes that belong to a single character?
I am referring to UTF-8 bytes that need to be encoded for an ASCII-only receiver. At the moment, non-ASCII-Printable characters are stripped. s/[^\x20-\x7e\r\n\t]//g
.
I want to improve this functionality to write \u0000
notation.
Upvotes: 1
Views: 364
Reputation: 385546
You need to have Unicode characters, so start by decoding your byte array.
use Encode qw( decode );
my $decoded_text = decode("UTF-8", $encoded_text);
Only then can you escape Unicode characters.
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
For example,
$ perl -CSDA -MEncode=decode -E'
my $encoded_text = "\xC3\x89\x72\x69\x63\x20\xE2\x99\xA5\x20\x50\x65\x72\x6c";
my $decoded_text = decode("UTF-8", $encoded_text);
say $decoded_text;
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
say $escaped_text;
'
Éric ♥ Perl
\u00C9ric \u2665 Perl
Upvotes: 2