Incorrect use of byte strings in Python CRC16 implementation?

Question

I'm trying to implement my own cyclic redundancy check (CRC) in Python. The layout of my program is as follows:

random_message(n) generates a random byte message of length n.
Generate the checksum value using the CRC code crc16.
Run the corruption code corrupt_data on the generated message.
Check whether the checksum is different or not (I did this using ==).
Repeat steps 1 to 4 many times to see how often an error (i.e., the corruption) goes unnoticed.

I am confident that the methods crc16 and corrupt_data are correct, so I don't think there's much reason to analyse them too closely. I think the problems start with my use of byte strings in the second half of my program, after those two methods.

My code is as follows:

from random import random
from random import choice
from string import ascii_uppercase

CORRUPTION_RATE = 0.25

def crc16(data: bytes):
    xor_in = 0x0000  # initial value
    xor_out = 0x0000  # final XOR value
    poly = 0x8005  # generator polinom (normal form)

    reg = xor_in
    for octet in data:
        # reflect in
        for i in range(8):
            topbit = reg & 0x8000
            if octet & (0x80 >> i):
                topbit ^= 0x8000
            reg <<= 1
            if topbit:
                reg ^= poly
        reg &= 0xFFFF
        # reflect out
    return reg ^ xor_out

from random import randbytes


def corrupt_data(data : bytes):
    '''
    some random corruption of byte data
    can be modified as needed using the CORRUPTION_RATE global constant
    ''' 
    temp = data[:]
    while True:
        location = int(len(temp) * random())
        data_list = list(temp)
        if random() < 0.5:
            data_list[location] = (data_list[location] + 1) % 256
        else: 
            data_list[location] = (data_list[location] - 1) % 256
        temp = bytes(data_list)
        if random() < CORRUPTION_RATE and temp != data:
            break
    return temp

# Generate random byte message of length n
def random_message(n):
    
    randomBytes = ''.join(choice(ascii_uppercase) for i in range(n)).encode()
    print("randomBytes is " + str(randomBytes))
    print("The class type of randomBytes is " + str(type(randomBytes)))
    return randomBytes

    
    
numberOfErrors = 0;

for i in range(10000):

    # generating random byte message of length n
    randomMessage = random_message(5)

    # generating the checksum value using the CRC code
    checksumValue = crc16(randomMessage)
    #print("checksumValue is " + str(checksumValue))
    #print("The class type of checksumValue is " + str(type(checksumValue)))

    # running the corruption on the generated message
    #print("The class type of bchecksumValue is " + str(type(b"checksumValue")))
    corrupt = corrupt_data(b"checksumValue")
    #print("The class type of corrupt_data(bchecksumValue) is " + str(type(corrupt)))

    #print("Checking whether the checksum is different ... ")
    different = (b"checksumValue" == corrupt)
    #print("bchecksumValue == corrupt is " + str(different))
    #print("bchecksumValue was " + str(b"checksumValue") + ", and corrupt was " + str(corrupt))
    
    if(different == False):
        numberOfErrors += 1
        
print("numberOfErrors is " + str(numberOfErrors))

As you can see, I've inserted various (now commented out) print statements to help me with debugging.

The problem is that, when I run the above code, I get that numberOfErrors is 10000. Obviously, this can't be correct, since we expect some of them to be correct, and so we expect numberOfErrors to be somewhat less than 10000.

As I said, I am confident that the crc16 and corrupt_data functions are correct, and I suspect that the problem is arising somewhere in my use of the byte strings inside the for loop:

numberOfErrors = 0;

for i in range(10000):

    # generating random byte message of length n
    randomMessage = random_message(5)

    # generating the checksum value using the CRC code
    checksumValue = crc16(randomMessage)
    #print("checksumValue is " + str(checksumValue))
    #print("The class type of checksumValue is " + str(type(checksumValue)))

    # running the corruption on the generated message
    #print("The class type of bchecksumValue is " + str(type(b"checksumValue")))
    corrupt = corrupt_data(b"checksumValue")
    #print("The class type of corrupt_data(bchecksumValue) is " + str(type(corrupt)))

    #print("Checking whether the checksum is different ... ")
    different = (b"checksumValue" == corrupt)
    #print("bchecksumValue == corrupt is " + str(different))
    #print("bchecksumValue was " + str(b"checksumValue") + ", and corrupt was " + str(corrupt))
    
    if(different == False):
        numberOfErrors += 1
        
print("numberOfErrors is " + str(numberOfErrors))

I've never really programmed with bytes / byte strings, and I've also only just recently started learning Python, so I don't understand what I'm doing incorrectly. Where's my error, and how do I fix it?

EDIT

As mentioned by user2357112 supports Monica in the comments, the problem might be b"checksumValue" in corrupt = corrupt_data(b"checksumValue"). The problem I had was that the function crc16 returns an int, so, in order to convert it back into bytes for passing into the function corrupt_data(data : bytes), I tried using the b prefix. I guess this is my Python inexperience showing.

EDIT2

Ok, so I'm trying the solution offered in this answer. The modified code is as follows:

# running the corruption on the generated message
bs = str(checksumValue).encode('ascii')
print("str(checksumValue).encode('ascii') is " + str(bs))
#print("The class type of bchecksumValue is " + str(type(b"checksumValue")))
print("The class type of str(checksumValue).encode('ascii') is " + str(type(bs)))
#corrupt = corrupt_data(b"checksumValue")
corrupt = corrupt_data(bs)
#print("The class type of corrupt_data(bchecksumValue) is " + str(type(corrupt)))
print("The class type of corrupt_data(bs) is " + str(type(corrupt)))

The output is

randomBytes is b'BBVFC'
The class type of randomBytes is 
checksumValue is 10073
The class type of checksumValue is 
str(checksumValue).encode('ascii') is b'10073'
The class type of str(checksumValue).encode('ascii') is 
The class type of corrupt_data(bs) is

So the classes seem to match with what we'd expect.

EDIT3

Implementing the changes in EDIT2 in the for loop, I still get numberOfErrors is 10000 as my output. The code is as follows:

numberOfErrors = 0;

for i in range(10000):

    # generating random byte message of length n
    randomMessage = random_message(5)

    # generating the checksum value using the CRC code
    checksumValue = crc16(randomMessage)
    #print("checksumValue is " + str(checksumValue))
    #print("The class type of checksumValue is " + str(type(checksumValue)))

    # running the corruption on the generated message
    bs = str(checksumValue).encode('ascii')
    #print("str(checksumValue).encode('ascii') is " + str(bs))
    #print("The class type of str(checksumValue).encode('ascii') is " + str(type(bs)))
    corrupt = corrupt_data(bs)
    #print("The class type of corrupt_data(bs) is " + str(type(corrupt)))
    
    #print("Checking whether the checksum is different ... ")
    different = (bs == corrupt)
    #print("bs == corrupt is " + str(different))
    #print("bs was " + str(bs) + ", and corrupt was " + str(corrupt))
    
    if(different == False):
        numberOfErrors += 1
        
print("numberOfErrors is " + str(numberOfErrors))

Blckknght · Accepted Answer

Your issue is not with the byte strings really, it's a logical error. You're trying to corrupt the wrong thing. You don't want to corrupt the checksum, you want to corrupt the original message and then take a checksum of the corrupted version. Then you can compare if the two checksums match or not.

Try:

undetected_errors = 0

for i in range(10000):
    good_message = random_message(5)
    good_checksum = crc16(good_message)

    corrupted_message = corrupt_data(good_message)
    corrupted_checksum = crc16(corrupted_message)

    if good_checksum == corrupted_checksum:
        undetected_errors += 1

Incorrect use of byte strings in Python CRC16 implementation?

EDIT

EDIT2

EDIT3

Answers (1)

Related Questions