Reputation: 3239
As input I get strings like this
Funda\195\131\194\167\195\131\194\163o
which I get from the Cymru Whois service:
$ dig +short AS10417.asn.cymru.com TXT
"10417 | BR | lacnic | 2000-02-15 | Funda\195\131\194\167\195\131\194\163o de Desenvolvimento da Pesquisa, BR"
Correctly decoded that would result in:
Fundação
In hexadigital notation this is:
b'\xc3\xa7\xc3\xa3'
where 0xc3 is 195, 0xa7 is 167 and 0xa3 is 163, matching the numbers of the first and last character of each quadruple.
So,\195\131\194\167
is ç
and \195\131\194\163
. It looks like Python cannot decode that, at least with the default parameters.
Is this kind of encoding common and is there any built-in functionality in Python to decode this generically (not specific to this string of course)?
Upvotes: 0
Views: 257
Reputation: 22478
The trick here is to use a custom replacement routine in re.sub
:
If
repl
is a function, it is called for every non-overlapping occurrence ofpattern
. The function takes a single match object argument, and returns the replacement string.
To convert the string to printable characters, encode it to bytes
using latin1
, which preserves all literal byte codes, then decode it as UTF-8:
import re
text = r'Funda\195\131\194\167\195\131\194\163o'
print (bytes('Fundação','utf8')) # This is our target
print (bytes(re.sub (r'\\(\d+)', lambda x: chr(int(x.group(1))), text).encode('latin1')).decode('utf-8'))
However, your text is not simply UTF-8 encoded but double encoded!
b'Funda\xc3\xa7\xc3\xa3o'
Fundação
so decoding it into UTF-8 yields another UTF-8 encoded string. We need to translate twice:
# This first line prints the byte values so you can compare it to the UTF-8 target:
print (bytes(re.sub (r'\\(\d+)', lambda x: chr(int(x.group(1))), text).encode('latin1')).decode('utf-8').encode('latin1'))
print (bytes(re.sub (r'\\(\d+)', lambda x: chr(int(x.group(1))), text).encode('latin1')).decode('utf-8').encode('latin1').decode('utf8'))
to finally get the output:
b'Funda\xc3\xa7\xc3\xa3o'
Fundação
Upvotes: 3