How to convert actual Unicode to \u0123

Question

I want to turn Unicode text into pure ASCII encoding using escape sequences.

Input :Ɏɇ衳 outputs to ... "\u024E\u0247\u8873"

Basically the opposite of this.

$ echo -e "\u024E\u0247\u8873"
Ɏɇ衳

I want the encoding to stay in utf8, all I'm doing is changing forms.

I've Tried:

iconv -f utf8 -t utf8  $file
iconv -f utf8 -t utf16  $file

tshiono · Accepted Answer

Your mentioned codes 024E, 0247, .. are called Unicode code points and are independent from UTF-8 or UTF-16.
If perl is your option, you can retrieve the codes with:

perl -C -ne 'map {printf "\u%04X", ord} (/./g)' <<< "Ɏɇ衳"; echo

which outputs:

\u024E\u0247\u8873

Explanation

The perl code above is mostly equivalent to:

#!/usr/bin/perl

use utf8;

$str = "Ɏɇ衳";
foreach $chr ($str =~ /./g) {
    printf "\u%04X", ord($chr);
}
print "
";

use utf8 specifies the string is encoded in UTF-8 (just because the string is embedded in the script).
($str =~ /./g) brakes the string into an array of characters.
foreach iterates over the array of characters.
ord returns the code point of the given character.

EDIT

If you want to auto-scale the number of digits considering the out-of-BMP characters, try instead:

#!/usr/bin/perl

use utf8;

$str = "Ɏɇ衳";
foreach $chr ($str =~ /./g) {
    $n = ord($chr);
    $d = $n > 0xffff ? 8 : 4;
    printf "\u%0${d}X", $n;
}

How to convert actual Unicode to \u0123

I've Tried:

Answers (2)

Related Questions