Octet (byte) stuffing and unstuffing, i.e. replacing one byte with two or v.v

Question

I am receiving binary data from a device, via a socket (bluetooth), using socket.recv().

I've considered buffering in a list or bytearray until sufficient data has been received to be decoded and acted upon. That is, until I've received a start- and a stop flag (bytes).

However, the device is applying "octet stuffing". That is, every occurrence of the following two bytes should be replaced with a single byte, like so:

dic = { '\xFE\xDC' : '\xFC' , '\xFE\xDD' : '\xFD' , '\xFE\xDE' : '\xFE'}

In addition, when data is being sent, the reverse should be applied. For instance, one byte 0xFC -> yields two bytes 0xFE, 0xDC.

What really happens is that when stuffing (sending data) if one of 0xFC, 0xFD, 0xFE is detected, 0xFE is added ahead of the byte, which is itself XORed 0x20. On unstuffing (receiving data) the 0XFE is discarded and the following byte is XORed 0x20.

To say I'm fairly new to Python would be an understatement. I've started coding yesterday, have an interface up and running. This however, is kind of tricky.

I know I can put the data into a string and do a replace. But to pack binary data into a string, replace and then unpack and decode seems a bit inefficient.

I can also watch incoming data and act when I see the 0xFE flag. What would be nice is a way to stuff/unstuff a list, bytearray or whatever.

Replacing single bytes in a list or bytearray doesn't seem too hard, but replacing one with two or the other way around ...?

Any help very much appreciated.

(This is Python 2.7 by the way.)

Peter Schneider · Accepted Answer

Overview

You need to wrap a bytestream and escape specific values. Also, the other way around is required: unescape control-codes and get the raw payload. You are working with sockets. The socket-commands uses string-parameters. In python, every string is basically a wrapper around a char*-array.

Naive approach

Its a string and we want to replace specific values with other ones. So what is the simplest way to achieve this?

def unstuff(self, s):
    return s.replace('\xFE\xDC', '\xFC').replace('\xFE\xDD', '\xFE').replace('\xFE\xDE', '\xFE')
def stuff(self, s):
    return s.replace('\xFC', '\xFE\xDC').replace('\xFD', '\xFE\xDD').replace('\xFE', '\xFE\xDE')

Seems to be bad. With every replace-call, a new string-copy will be created.

Iterator

A very pythonic approach is to define an iterator for this specific problem: define the iterator to transform the input-data into the desired output.

def unstuff(data):
    i = iter(data)
    dic = {'\xDC' : '\xFC', '\xDD' : '\xFD', '\xFE' : '\xDE'}
    while True:
        d = i.next()  # throws StopIteration on the end
        if d == '\xFE':
            d2 = i.next()
            if d2 in dic:
                yield dic[d2]
            else:
                yield '\xFE'
                yield d2
        else:
            yield d

def stuff(data):
    i = iter(data)
    dic = { '\xFC' : '\xDC', '\xFD' : '\xDD', '\xFE' : '\xDE' }
    while True: 
        d = i.next() # throws StopIteration on the end
        if d in dic:
            yield '\xFE'
            yield dic[d]
        else:
            yield d

def main():
    s = 'hello\xFE\xDCWorld'
    unstuffed = "".join(unstuff(s))
    stuffed = "".join(stuff(unstuffed))
    print s, unstuffed, stuffed

    # also possible
    for c in unstuff(s):
        print ord(c)

if __name__ == '__main__':
    main()

stuff() and unstuff() need something iterable (list, string, ...) and return an iterator-object. If you want to print the result or pass it into socket.send, you need to convert it back to a string (as shown with "".join()). Every unexpected data is handled somehow: 0xFE 0x__ will be returned verbatim, if it does not match any pattern.

RegExp

Another way would be to use regular expressions. Its a big topic and a source of trouble sometimes, but we can keep it simple:

import re

s = 'hello\xFE\xDCWorld' # our test-string

# read: FE DC or FE DD or FE DE
unstuff   = re.compile('\xFE\xDC|\xFE\xDD|\xFE\xDE')
# read: 
#  - use this pattern to match against the string
#  - replace what you have found (m.groups(0), whole match) with
#    char(ord(match[1])^0x20)
unstuffed = unstuff.sub(lambda m: chr(ord(m.group(0)[1])^0x20), s)

# same thing, other way around
stuff   = re.compile('\xFC|\xFD|\xFE')
stuffed = stuff.sub(lambda m: '\xFE' + chr(ord(m.group(0))^0x20), unstuffed)

print s, unstuffed, stuffed

As said, you must create the new string somewhere to be able to use it with sockets. At least, this approach do not create unnecessary copies of the string like s.replace(..).replace(..).replace(..) would. You should keep the patterns stuffand unstuff somewhere around as building these objects is relatively expensive.

native C-Function

If some things are going to slow in python, we might want to use cpython and implement it as oure C-code. Basically, I do a first run, count how many bytes I nerd, allocate a new string and do the second run. I'm not very used to python-c-extensions and so I do not want to share this code. It just seems to work, see the next chapter

Comparison

One of the most important rules of optimization: compare! The basic setup for every test:

generate random binary data as a string
while less_than_a_second:
    unstuff(stuff(random_data))
    count += 1
return time_needed / count

I know, the setup isn't optimal. But we should get some usable result:

What do we see? Native is the fastest way to go, but only for very small strings. This is probably because of the python-interpreter: only one function-call is needed instead of three. But microseconds is fast enough the most of the times. After ~500 bytes, the timings are nearly the same with the naive approach. There must be some deep magic happening down there in the implementation. Iterators and RegExp are unacceptable compared to the effort.

To sum things up: use the naive approach. Its hard to get something better. Also: if you simply guess about timings, you will be almost always wrong.