Reputation: 87

Generating unique non-similar codes with validation

I know there are similar questions so please bear with me.

I wish to generate approximately 50K codes for people to place orders - ideally no longer than 10 chars and can include letters and digits. They are not discount codes so I am not worried about people trying to guess codes. What I am worried about is somebody accidentally entering a wrong digit (ie 1 instead of l or 0 instead of O) and then the system will fail if by chance it is also a valid code.

As the codes are constantly being generated, ideally I don't want a table look-up validation, but an formula (eg if it contains an A the number element should be divisable by 13 or some such).

Upvotes: 1

Answers (2)

r3mainer

Reputation: 24547

Before you start generating random combinations of characters, there are a couple of things you need to bear in mind:

1. Profanity

If your codes include every possible combination of four letters from the alphabet, they will inevitably include every four-letter word. You need to be absolutely sure that you never ask customers to enter anything foul or offensive.

2. Human error

People often make mistakes when entering codes. Confusing similar characters like O and 0 is only part of the problem. Other common mistakes include transposing adjacent characters (e.g. the → teh) and hitting the wrong key on the keyboard (e.g., and → amd)

To avoid these issues, I would recommend that you generate codes from a restricted alphabet that has no possibility of spelling out anything unfortunate, and use the Luhn algorithm or something similar to catch accidental data entry errors.

For example, here's some Python code that generates hexadecimal codes using an alphabet of 16 characters with no vowels. It uses a linear congruential generator step to avoid outputting sequential numbers, and includes a base-16 Luhn checksum to detect input errors. The code2int() function will return −1 if the checksum is incorrect. Otherwise it will return an integer. If this integer is less than your maximum input value (e.g., 50,000), then you can assume the code is correct.

def int2code(n):
    # Generates a 7-character code from an integer value (n > 0)
    alph = 'BCDFGHJKMNPRTWXZ'
    mod = 0xfffffd          # Highest 24-bit prime
    mul = 0xc36572          # Randomly selected multiplier
    add = 0x5d48ca          # Randomly selected addend
    # Convert the input number `n` into a non-sequential 6-digit
    # hexadecimal code by means of a linear congruential generator
    c = "%06x" % ((n * mul + add) % mod)
    # Replace each hex digit with the corresponding character from alph.
    # and generate a base-16 Luhn checksum at the same time
    luhn_sum = 0
    code = ''
    for i in range(6):
        d = int(c[i], 16)
        code += alph[d]
        if i % 2 == 1:
            t = d * 15
            luhn_sum += (t & 0x0f) + (t >> 4)
        else:
            luhn_sum += d
    # Append the checksum
    checksum = (16 - (luhn_sum % 16)) % 16
    code += alph[checksum]
    return code

def code2int(code):
    # Converts a 7-character code back into an integer value
    # Returns -1 if the input is invalid
    alph = 'BCDFGHJKMNPRTWXZ'
    mod = 0xfffffd          # Highest 24-bit prime
    inv = 0x111548          # Modular multiplicative inverse of 0xc36572
    sub = 0xa2b733          # = 0xfffffd - 0x5d48ca
    if len(code) != 7:
        return -1
    # Treating each character as a hex digit, convert the code back into
    # an integer value. Also make sure the Luhn checksum is correct
    luhn_sum = 0
    c = 0
    for i in range(7):
        if code[i] not in alph:
            return -1
        d = alph.index(code[i])
        c = c * 16 + d
        if i % 2 == 1:
            t = d * 15
            luhn_sum += (t & 0x0f) + (t >> 4)
        else:
            luhn_sum += d
    if luhn_sum % 16 != 0:
        return -1
    # Discard the last digit (corresponding to the Luhn checksum), and undo
    # the LCG calculation to retrieve the original input value
    c = (((c >> 4) + sub) * inv) % mod
    return c

# Test

>>> print('\n'.join([int2code(i) for i in range(10)]))
HWGMTPX
DBPXFZF
XGCFRCN
PKKNDJB
JPWXNRK
DXGGCBR
ZCPNMDD
RHBXZKN
KMKGJTZ
FRWNXCH
>>> print(all([code2int(int2code(i)) == i for i in range(50000)]))
True

Upvotes: 2

user1196549

Reputation:

Select some alphabet (made of digits and letters) of size B such that there are no easy confusions. Assign every symbol a value from 0 to B-1, preferably in random order. Now you can use sequential integers, convert them to base B and assign the symbols accordingly.

For improved safety, you can append one or two checksum symbols for error detection.

With N=34 (ten digits and twenty four letters 9ABHC0FVW3YGJKL1N2456XRTS78DMPQEUZ), 50K codes require codes of length only four.

If you don't want the generated codes to be consecutive, you can scramble the bits before the change of base.

Upvotes: 3

Generating unique non-similar codes with validation

Answers (2)

1. Profanity

2. Human error

Related Questions