Reputation: 345
I want to turn Unicode text into pure ASCII encoding using escape sequences.
Input :Ɏɇ衳
outputs to ... "\u024E\u0247\u8873"
Basically the opposite of this.
$ echo -e "\u024E\u0247\u8873"
Ɏɇ衳
I want the encoding to stay in utf8
, all I'm doing is changing forms.
iconv -f utf8 -t utf8 $file
iconv -f utf8 -t utf16 $file
Upvotes: 0
Views: 326
Reputation: 22012
Your mentioned codes 024E, 0247, ..
are called Unicode code points and are independent from UTF-8 or UTF-16.
If perl
is your option, you can retrieve the codes with:
perl -C -ne 'map {printf "\\u%04X", ord} (/./g)' <<< "Ɏɇ衳"; echo
which outputs:
\u024E\u0247\u8873
Explanation
The perl code above is mostly equivalent to:
#!/usr/bin/perl
use utf8;
$str = "Ɏɇ衳";
foreach $chr ($str =~ /./g) {
printf "\\u%04X", ord($chr);
}
print "\n";
use utf8
specifies the string is encoded in UTF-8 (just because the string is embedded in the script).($str =~ /./g)
brakes the string into an array of characters.foreach
iterates over the array of characters.ord
returns the code point of the given character.EDIT
If you want to auto-scale the number of digits considering the out-of-BMP characters, try instead:
#!/usr/bin/perl
use utf8;
$str = "Ɏɇ衳";
foreach $chr ($str =~ /./g) {
$n = ord($chr);
$d = $n > 0xffff ? 8 : 4;
printf "\\u%0${d}X", $n;
}
Upvotes: 2
Reputation: 81
If you have that in a file you can use iconv.
iconv -f $input_encoding -t $output_encoding $file
check "man iconv" for more details
Upvotes: -1