sebix
sebix

Reputation: 3239

Decoding unicode strings with base 10 integers escape sequences

As input I get strings like this

Funda\195\131\194\167\195\131\194\163o

which I get from the Cymru Whois service:

$ dig +short AS10417.asn.cymru.com TXT
"10417 | BR | lacnic | 2000-02-15 | Funda\195\131\194\167\195\131\194\163o de Desenvolvimento da Pesquisa, BR"

Correctly decoded that would result in:

Fundação

In hexadigital notation this is:

b'\xc3\xa7\xc3\xa3'

where 0xc3 is 195, 0xa7 is 167 and 0xa3 is 163, matching the numbers of the first and last character of each quadruple.

So,\195\131\194\167 is ç and \195\131\194\163. It looks like Python cannot decode that, at least with the default parameters.

Is this kind of encoding common and is there any built-in functionality in Python to decode this generically (not specific to this string of course)?

Upvotes: 0

Views: 257

Answers (1)

Jongware
Jongware

Reputation: 22478

The trick here is to use a custom replacement routine in re.sub:

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.

To convert the string to printable characters, encode it to bytes using latin1, which preserves all literal byte codes, then decode it as UTF-8:

import re

text = r'Funda\195\131\194\167\195\131\194\163o'

print (bytes('Fundação','utf8')) # This is our target
print (bytes(re.sub (r'\\(\d+)', lambda x: chr(int(x.group(1))), text).encode('latin1')).decode('utf-8'))

However, your text is not simply UTF-8 encoded but double encoded!

b'Funda\xc3\xa7\xc3\xa3o'
Fundação

so decoding it into UTF-8 yields another UTF-8 encoded string. We need to translate twice:

# This first line prints the byte values so you can compare it to the UTF-8 target:
print (bytes(re.sub (r'\\(\d+)', lambda x: chr(int(x.group(1))), text).encode('latin1')).decode('utf-8').encode('latin1'))

print (bytes(re.sub (r'\\(\d+)', lambda x: chr(int(x.group(1))), text).encode('latin1')).decode('utf-8').encode('latin1').decode('utf8'))

to finally get the output:

b'Funda\xc3\xa7\xc3\xa3o'
Fundação

Upvotes: 3

Related Questions