Dekel
Dekel

Reputation: 62666

Convert `\195\164` to `u'\xc4'` - result from DNS resolver back to unicode

Doing a DNS resolve on a unicode-hostname return the following:

'\195\164\195\182\195\188o.mydomain104.local.'

The \195\164 is actually the following unicode letter: Ä (u'\xc4').

The original hostname is:

ÄÖÜO.mydomain104.local

I'm looking for a way to convert it back to the unicode string (in python2.7)

In case the original code is needed, it's something like the following:

from dns import resolver, reversename
from dns.exception import DNSException

def get_name(ip_address):
    answer = None
    res = resolver.Resolver()
    addr = reversename.from_address(ip_address)
    try:
        answer = res.query(addr, "PTR")[0].to_text().decode("utf-8")
    except DNSException:
        pass
    return answer

I was looking at both .encode and .decode, the unicodedata lib and codecs and found nothing that worked.

Upvotes: 0

Views: 298

Answers (1)

unutbu
unutbu

Reputation: 880627

Clue #1:

In [1]: print(b'\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf_8'))
äöü 

In [2]: print(bytearray([195,164,195,182,195,188]).decode('utf-8'))
'äöü'

Clue #2: Per the docs, Python interprets \ooo as the ASCII character with octal value ooo, and \xhh as the ASCII character with hex value hh.

Since 9 is not a valid octal number, '\195' is interpreted as '\1' and '95'.

hex(195) is '0xc3'. So instead of '\195' we want '\xc3'. We need to convert decimals after each backslash into the form \xhh.


In Python2:

import re

given = r'\195\164\195\182\195\188o.mydomain104.local.'
# print(list(given))
decimals_to_hex = re.sub(r'\\(\d+)', lambda match: '\\x{:x}'.format(int(match.group(1))), given)
# print(list(decimals_to_hex))
result = decimals_to_hex.decode('string_escape')
print(result)

prints

äöüo.mydomain104.local.

In Python3, use codecs.escape_decode instead of decode('string_escape'):

import re
import codecs

given = rb'\195\164\195\182\195\188o.mydomain104.local.'

decimals_to_hex = re.sub(rb'\\(\d+)',
    lambda match: ('\\x{:x}'.format(int(match.group(1)))).encode('ascii'), given)
print(codecs.escape_decode(decimals_to_hex)[0].decode('utf-8'))

prints the same result.

Upvotes: 4

Related Questions