Matheus Monteiro
Matheus Monteiro

Reputation: 19

Converting UTF-8 hex to Unicode hex

I have strings such as "flamenguistas e s\xc3a3o paulinos", containing UTF-8 codes in this manner "\xc3a3". How do I turn this into the letter "ã"?

I'm having trouble because most functions for un-escaping expect codes to be Unicode and I haven't been able to find a proper manner to convert the UTF-8 hex to the Unicode hex.

Is there an easy way to transform UTF-8 hex to Unicode hex aside from writing a function reading from a table and converting?

P.S. When I say "Unicode hex"/"UTF-8 hex" I mean as in here: https://en.wikipedia.org/wiki/%C3%87#Computer

Upvotes: 0

Views: 616

Answers (1)

daxim
daxim

Reputation: 39158

It looks like R has support for PCRE regex. You can port the following substitution.

The hex function takes a string of hex digits and converts it into a number. The chr function takes a number and turns it into a character. The dot operator is string concatenation. The whole result consists of UTF-8 encoded octets.

#!/usr/bin/env perl
$_ = <<'';
flamenguistas e s\xc3a3o paulinos

s|
    \\x             # literal \x
    (               # capture into $1
        [0-9a-f]    # hex digits
        {2}         # exactly two times
    )
    (               # capture into $2
        [0-9a-f]
        {2}
    )
|
    chr(hex($1)) . chr(hex($2))
|egmsx;

print; # flamenguistas e são paulinos

Upvotes: 1

Related Questions