Reputation: 339

generate unique string of length n without prefilled dictionary

I have an application that is kind of like a URL shortener and need to generate unique URL whenever a user requests.

For this I need a function to map an index/number to a unique string of length n with two requirements:

Two different numbers can not generate same string. In other words as long as i,j<K: f(i) != f(j). K is the number of possible strings = 26^n. (26 is number of characters in English)
Two strings generated by number i and i+1 don't look similar most of the times. For example they are not abcdef1 and abcdef2. (So that users can not predict the pattern and the next IDs)

This is my current code in Python:

chars = "abcdefghijklmnopqrstuvwxyz"

for item in itertools.product(chars, repeat=n):
    print("".join(item))

# For n = 7 generates:
# aaaaaaa
# aaaaaab
# aaaaaac
# ...

The problem with this code is there is no index that I can use to generate unique strings on demand by tracking that index. For example generate 1 million unique strings today and 2 million tomorrow without looping through or collision with the first 1 million.

The other problem with this code is that the strings that are created after each other look very similar and I need them to look random.

One option is to populate a table/dictionary with millions of strings, shuffle them and keep track of index to that table but it takes a lot of memory.

An option is also to check the database of existing IDs after generating a random string to make sure it doesn't exist but the problem is as I get closer to the K (26^n) the chance of collision increases and it wouldn't be efficient to make a lot of check_if_exist queries against the database.

Also if n was long enough I could use UUID with small chance of collision but in my case n is 7.

Upvotes: 1

Answers (4)

btilly

Reputation: 46477

I'm going to outline a solution for you that is going to resist casual inspection even by a knowledgeable person, though it probably IS NOT cryptographically secure.

First, your strings and numbers are in a one-to-one map. Here is some simple code for that.

alphabet = 'abcdefghijklmnopqrstuvwxyz'
len_of_codes = 7
char_to_pos = {}
for i in range(len(alphabet)):
    char_to_pos[alphabet[i]] = i

def number_to_string(n):
    chars = []
    for _ in range(len_of_codes):
        chars.append(alphabet[n % len(alphabet)])
        n = n // len(alphabet)
    return "".join(reversed(chars))

def string_to_number(s):
    n = 0
    for c in s:
        n = len(alphabet) * n + char_to_pos[c]
    return n

So now your problem is how to take an ascending stream of numbers and get an apparently random stream of numbers out of it instead. (Because you know how to turn those into strings.) Well, there are lots of tricks for primes, so let's find a decent sized prime that fits in the range that you want.

def is_prime (n):
    for i in range(2, n):
        if 0 == n%i:
            return False
        elif n < i*i:
            return True
    if n == 2:
        return True
    else:
        return False

def last_prime_before (n):
    for m in range(n-1, 1, -1):
        if is_prime(m):
            return m

print(last_prime_before(len(alphabet)**len_of_codes)

With this we find that we can use the prime 8031810103. That's how many numbers we'll be able to handle.

Now there is an easy way to scramble them. Which is to use the fact that multiplication modulo a prime scrambles the numbers in the range 1..(p-1).

def scramble1 (p, k, n):
    return (n*k) % p

Picking a random number to scramble by, int(random.random() * 26**7) happened to give me 3661807866, we get a sequence we can calculate with:

for i in range(1, 5):
    print(number_to_string(scramble1(8031810103, 3661807866, i)))

Which gives us

lwfdjoc
xskgtce
jopkctb
vkunmhd

This looks random to casual inspection. But will be reversible for any knowledgeable someone who puts modest effort in. They just have to guess the prime and algorithm that we used, look at 2 consecutive values to get the hidden parameter, then look at a couple of more to verify it.

Before addressing that, let's figure out how to take a string and get the number back. Thanks to Fermat's little theorem we know for p prime and 1 <= k < p that (k * k^(p-2)) % p == 1.

def n_pow_m_mod_k (n, m, k):
    answer = 1
    while 0 < m:
        if 1 == m % 2:
            answer = (answer * n) % k
        m = m // 2
        n = (n * n) % k
    return answer

print(n_pow_m_mod_k(3661807866, 8031810103-2, 8031810103))

This gives us 3319920713. Armed with that we can calculate scramble1(8031810103, 3319920713, string_to_number("vkunmhd")) to find out that vkunmhd came from 4.

Now let's make it harder. Let's generate several keys to be scrambling with:

import random
p = 26**7
for i in range(5):
    p = last_prime_before(p)
    print((p, int(random.random() * p)))

When I ran this I happened to get:

(8031810103, 3661807866)
(8031810097, 3163265427)
(8031810091, 7069619503)
(8031809963, 6528177934)
(8031809917, 991731572)

Now let's scramble through several layers, working from smallest prime to largest (this requires reversing the sequence):

def encode (n):
    for p, k in [
            (8031809917, 991731572)
          , (8031809963, 6528177934)
          , (8031810091, 7069619503)
          , (8031810097, 3163265427)
          , (8031810103, 3661807866)
        ]:
        n = scramble1(p, k, n)
    return number_to_string(n)

This will give a sequence:

ehidzxf
shsifyl
gicmmcm
ofaroeg

And to reverse it just use the same trick that reversed the first scramble (reversing the primes so I am unscrambling in the order that I started with):

def decode (s):
    n = string_to_number(s)
    for p, k in [
            (8031810103, 3319920713)
          , (8031810097, 4707272543)
          , (8031810091, 5077139687)
          , (8031809963, 192273749)
          , (8031809917, 5986071506)
        ]:
        n = scramble1(p, k, n)

    return n

TO BE CLEAR I do NOT promise that this is cryptographically secure. I'm not a cryptographer, and I'm aware enough of my limitations that I know not to trust it.

But I do promise that you'll have a sequence of over 8 billion strings that you are able to encode/decode with no obvious patterns.

Now take this code, scramble the alphabet, regenerate the magic numbers that I used, and choose a different number of layers to go through. I promise you that I personally have absolutely no idea how someone would even approach the problem of figuring out the algorithm. (But then again I'm not a cryptographer. Maybe they have some techniques to try. I sure don't.)

Upvotes: 3

Ribodou

Reputation: 81

How about :

from random import Random

n = 7


def f(i):
    myrandom = Random()
    myrandom.seed(i)
    alphabet = "123456789"
    return "".join([myrandom.choice(alphabet) for _ in range(7)])


# same entry, same output
assert f(0) == "7715987"
assert f(0) == "7715987"
assert f(0) == "7715987"

# different entry, different output
assert f(1) == "3252888"

(change the alphabet to match your need)

This "emulate" a UUID, since you said you could accept a small chance of collision. If you want to avoid collision, what you really need is a perfect hash function (https://en.wikipedia.org/wiki/Perfect_hash_function).

Upvotes: 2

Ayush Garg

Reputation: 2517

This is a really simple example of what I outlined in this comment. It just offsets the number based on i. If you want "different" strings, don't use this, because if num is 0, then you will get abcdefg (with n = 7).

alphabet = "abcdefghijklmnopqrstuvwxyz"

# num is the num to convert, i is the "offset"
def num_to_char(num, i):
    return alphabet[(num + i) % 26]

# generate the link
def generate_link(num, n):
    return "".join([num_to_char(num, i) for i in range(n)])

generate_link(0, 7) # "abcdefg"
generate_link(0, 7) # still "abcdefg"
generate_link(0, 7) # again, "abcdefg"!

generate_link(1, 7) # now it's "bcdefgh"!

You would just need to change the num + i to some complicated and obscure math equation.

Upvotes: 0

SR3142

Reputation: 610

you can try something based on the sha1 hash

#!/usr/bin/python3

import hashlib

def generate_link(i):
    n = 7
    a = "abcdefghijklmnopqrstuvwxyz01234567890"
    return "".join(a[x%36] for x in hashlib.sha1(str(i).encode('ascii')).digest()[-n:])

Upvotes: 0

generate unique string of length n without prefilled dictionary

Answers (4)

Related Questions