Caucasian Malaysian
Caucasian Malaysian

Reputation: 345

How to convert actual Unicode to \u0123

I want to turn Unicode text into pure ASCII encoding using escape sequences.

Input :Ɏɇ衳 outputs to ... "\u024E\u0247\u8873"

Basically the opposite of this.

$ echo -e "\u024E\u0247\u8873"
Ɏɇ衳

I want the encoding to stay in utf8, all I'm doing is changing forms.

I've Tried:

iconv -f utf8 -t utf8  $file
iconv -f utf8 -t utf16  $file

Upvotes: 0

Views: 326

Answers (2)

tshiono
tshiono

Reputation: 22012

Your mentioned codes 024E, 0247, .. are called Unicode code points and are independent from UTF-8 or UTF-16.
If perl is your option, you can retrieve the codes with:

perl -C -ne 'map {printf "\\u%04X", ord} (/./g)' <<< "Ɏɇ衳"; echo

which outputs:

\u024E\u0247\u8873

Explanation

The perl code above is mostly equivalent to:

#!/usr/bin/perl

use utf8;

$str = "Ɏɇ衳";
foreach $chr ($str =~ /./g) {
    printf "\\u%04X", ord($chr);
}
print "\n";
  • use utf8 specifies the string is encoded in UTF-8 (just because the string is embedded in the script).
  • ($str =~ /./g) brakes the string into an array of characters.
  • foreach iterates over the array of characters.
  • ord returns the code point of the given character.

EDIT

If you want to auto-scale the number of digits considering the out-of-BMP characters, try instead:

#!/usr/bin/perl

use utf8;

$str = "Ɏɇ衳";
foreach $chr ($str =~ /./g) {
    $n = ord($chr);
    $d = $n > 0xffff ? 8 : 4;
    printf "\\u%0${d}X", $n;
}

Upvotes: 2

Lili Sousa
Lili Sousa

Reputation: 81

If you have that in a file you can use iconv.

iconv -f $input_encoding -t $output_encoding $file

check "man iconv" for more details

Upvotes: -1

Related Questions