Implementing DNS message NAME compression algorithm in Python

Question

In DNS messages, domain names are represented like so:

Address    Value
----------------------------------------------------------------
0x00C      0x5 s t o r e 0x7 e x a m p l e 0x3 c o m 0x0
0x023      0x4 b l o g  0xC012

At address 0x00C we have store.example.com and at address 0x023 we have blog.example.com. 0xC012 represents a pointer to 0x012 where example.com begins in the first domain name, so we don't have to repeat it.

I managed to turn a domain name into an array of labels using the following code:

labels = []
for part in self.domain.split('.'):
    label = part.encode('ascii')
    length = len(label).to_bytes(1, 'big')
    labels.append(length + label)

This gives me the following array:

[b'\x04blog', b'\x07example', b'\x03com']

But I have trouble building a regular expression or an efficient function that would search the part of the message that's already written for either \x04blog\x07example\x03com\x00 or \x07example\x03com\x00 or \x03com00. I'd like to find the longest match I can. I tried building a regex like this:

(((\x04blog)?\x07example)?\x03com)?\x00

But it seems the regex engine prefers just matching for \x00 in that case. How do I fix that?

There's also another issue: we could technically add my.blog.example.com which would need to contain 0x2 m y 0xC023 with 0xC023 pointing to blog.example.com.

Patrick Mevzek · Accepted Answer

First sorry for the false hint I gave you, there are far simpler things to do. In fact there are far simpler things to do than using a regex here. regex are far too powerful to do this, you can achieve the same result without.

DNS message format is defined in RFC 1035 : "DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION"

Name compression is explained in section 4.1.4. Note also that you have to take into account the case insensitive aspect, as outlined in section 2.3.3

It should be the case in the following code that can surely be optimized, with example names taken from RFC:

#!/usr/bin/python3

names=['F.ISI.ARPA', 'FOO.F.ISI.ARPA', 'ARPA']
cache={}
pos=0
results=[]

for name in names:
    labels = name.split('.')
    while len(labels):
        key = '.'.join(labels)
        if key.lower() in cache:
            results.append((cache[key.lower()] + 2**15 + 2**14).to_bytes(2, 'big'))
            pos += 2
            break
        else:
            label = labels.pop(0).encode('ascii')
            length = len(label).to_bytes(1, 'big')
            results.append(length + label)
            cache[key.lower()] = pos
            pos += len(label) + 1
    else:
        results.append(b'0')
        pos += 1



print(results)

which produces:

[b'\x01F', b'\x03ISI', b'\x04ARPA', b'0', b'\x03FOO', b'\xc0\x00', b'\xc0\x06']

If instead using your example:

names=['store.example.com', 'blog.example.com', 'my.blog.example.com']

it produces:

[b'\x05store', b'\x07example', b'\x03com', b'0', b'\x04blog', b'\xc0\x06', b'\x02my', b'\xc0\x13']

Implementing DNS message NAME compression algorithm in Python

Answers (1)

Related Questions