Reputation: 19
I have strings such as "flamenguistas e s\xc3a3o paulinos", containing UTF-8 codes in this manner "\xc3a3". How do I turn this into the letter "ã"?
I'm having trouble because most functions for un-escaping expect codes to be Unicode and I haven't been able to find a proper manner to convert the UTF-8 hex to the Unicode hex.
Is there an easy way to transform UTF-8 hex to Unicode hex aside from writing a function reading from a table and converting?
P.S. When I say "Unicode hex"/"UTF-8 hex" I mean as in here: https://en.wikipedia.org/wiki/%C3%87#Computer
Upvotes: 0
Views: 616
Reputation: 39158
It looks like R has support for PCRE regex. You can port the following substitution.
The hex function takes a string of hex digits and converts it into a number. The chr function takes a number and turns it into a character. The dot operator is string concatenation. The whole result consists of UTF-8 encoded octets.
#!/usr/bin/env perl
$_ = <<'';
flamenguistas e s\xc3a3o paulinos
s|
\\x # literal \x
( # capture into $1
[0-9a-f] # hex digits
{2} # exactly two times
)
( # capture into $2
[0-9a-f]
{2}
)
|
chr(hex($1)) . chr(hex($2))
|egmsx;
print; # flamenguistas e são paulinos
Upvotes: 1