mikeLundquist
mikeLundquist

Reputation: 1009

String of text to unique integer method?

Is there a method that converts a string of text such as 'you' to a number other than

y = tuple('you')
for k in y:
  k = ord(k)

which only converts one character at a time?

Upvotes: 9

Views: 20636

Answers (5)

pitfall
pitfall

Reputation: 2621

Though there are a number of ways to fulfill this task, I prefer the hashing way because it has the following nice properties

  1. it ensures that the number you get is highly random, actually uniformly random
  2. it ensures that even a small change in your input string will lead to a significant difference in output integer.
  3. it is an irreversible process, i.e., you can't tell which string is the input based on the integer output.

    import hashlib
    # there are a number of hashing functions you can pick, and they provide tags of different lengths and security levels.
    hashing_func = hashlib.md5
    
    # the lambda func does three things
    # 1. hash a given string using the given algorithm
    # 2. retrive its hex hash tag
    # 3. convert hex to integer 
    str2int = lambda s : int(hashing_func(s.encode()).hexdigest(), 16) 

To see how the resulting integers are uniform randomly distributed, we first need to have some random string generator


    import string
    import numpy as np 
    # candidate characters
    letters = string.ascii_letters
    # total number of candidates
    L = len(letters)
    # control the seed or prng for reproducible results
    prng = np.random.RandomState(1234)
    
    # define the string prng of length 10
    prng_string = lambda : "".join([letters[k] for k in prng.randint(0, L, size=(10))])

Now we generate sufficient number of random strings and obtain corresponding integers


    ss = [prng_string() for x in range(50000)]
    vv = np.array([str2int(s) for s in ss])

Let us check the randomness by comparing the theoretical mean and standard deviation of a uniform distribution and those we observed.


    for max_num in [256, 512, 1024, 4096] :
        ints = vv % max_num
        print("distribution comparsions for max_num = {:4d} \n\t[theoretical] {:7.2f} +/- {:8.3f} | [observed] {:7.2f} +/- {:8.3f}".format(
            max_num, max_num/2., np.sqrt(max_num**2/12), np.mean(ints), np.std(ints)))

Finally, you will see the results below, which indicates that the number you got are very uniform.

distribution comparsions for max_num =  256 
    [theoretical]  128.00 +/-   73.901 | [observed]  127.21 +/-   73.755
distribution comparsions for max_num =  512 
    [theoretical]  256.00 +/-  147.802 | [observed]  254.90 +/-  147.557
distribution comparsions for max_num = 1024 
    [theoretical]  512.00 +/-  295.603 | [observed]  512.02 +/-  296.519
distribution comparsions for max_num = 4096 
    [theoretical] 2048.00 +/- 1182.413 | [observed] 2048.67 +/- 1181.422

It is worthy to call out that other posted answers may not attain these these properties.

For example, @poke's convertToNumber solution will give

distribution comparsions for max_num =  256 
    [theoretical]  128.00 +/-   73.901 | [observed]   93.48 +/-   17.663
distribution comparsions for max_num =  512 
    [theoretical]  256.00 +/-  147.802 | [observed]  220.71 +/-  129.261
distribution comparsions for max_num = 1024 
    [theoretical]  512.00 +/-  295.603 | [observed]  477.67 +/-  277.651
distribution comparsions for max_num = 4096 
    [theoretical] 2048.00 +/- 1182.413 | [observed] 1816.51 +/- 1059.643

Upvotes: 2

Anderson Arroyo
Anderson Arroyo

Reputation: 367

I was trying to find a way to convert a numpy character array into a unique numeric array in order to do some other stuff. I have implemented the following functions including the answers by @poke and @falsetrue (these methods were giving me some trouble when the strings were too large). I have also added the hash method (a hash is a fixed sized integer that identifies a particular value.)

import numpy as np
def str_to_num(x):
    """Converts a string into a unique concatenated UNICODE representation

    Args:
        x (string): input string

    Raises:
        ValueError: x must be a string

    """
    if isinstance(x, str):
        x = [str(ord(c)) for c in x]
        x = int(''.join(x))
    else:
        raise ValueError('x must be a string.')

    return x


def chr_to_num(x):
    return int.from_bytes(x.encode(), 'little')


def char_arr_to_num(arr, type = 'hash'):
    """Converts a character array into a unique hash representation.

    Args:
        arr (np.array): numpy character array.
    """
    if type == 'unicode':
        vec_fun = np.vectorize(str_to_num)
    elif type == 'byte':
        vec_fun = np.vectorize(chr_to_num)
    elif type == 'hash':
        vec_fun = np.vectorize(hash)    
    out = np.apply_along_axis(vec_fun, 0, arr)
    out = out.astype(float)
    return out

a = np.array([['x', 'y', 'w'], ['x', 'z','p'], ['y', 'z', 'w'], ['x', 'w','y'], ['w', 'z', 'q']])
char_arr_to_num(a, type = 'unicode')
char_arr_to_num(a, type = 'byte')
char_arr_to_num(a, type = 'hash')

Upvotes: 1

chepner
chepner

Reputation: 531918

Treat the string as a base-255 number.

# Reverse the digits to make reconstructing the string more efficient
digits = reversed(ord(b) for b in y.encode())
n = reduce(lambda x, y: x*255 + y, digits)

new_y = ""
while n > 0:
    n, b = divmod(n, 255)
    new_y += chr(b)
assert y == new_y.decode()

(Note this is essentially the same as poke's answer, but written explicitly rather than using available methods for converting between a byte string and an integer.)

Upvotes: 3

poke
poke

Reputation: 388113

In order to convert a string to a number (and the reverse), you should first always work with bytes. Since you are using Python 3, strings are actually Unicode strings and as such may contain characters that have a ord() value higher than 255. bytes however just have a single byte per character; so you should always convert between those two types first.

So basically, you are looking for a way to convert a bytes string (which is basically a list of bytes, a list of numbers 0–255) into a single number, and the inverse. You can use int.to_bytes and int.from_bytes for that:

import math
def convertToNumber (s):
    return int.from_bytes(s.encode(), 'little')

def convertFromNumber (n):
    return n.to_bytes(math.ceil(n.bit_length() / 8), 'little').decode()
>>> convertToNumber('foo bar baz')
147948829660780569073512294
>>> x = _
>>> convertFromNumber(x)
'foo bar baz'

Upvotes: 32

falsetru
falsetru

Reputation: 369324

  1. You don't need to convert the string into tuple
  2. k is overwritten. Collect items using something like list comprehension:

>>> text = 'you'
>>> [ord(ch) for ch in text]
[121, 111, 117]

To get the text back, use chr, and join the characters using str.join:

>>> numbers = [ord(ch) for ch in text]
>>> ''.join(chr(n) for n in numbers)
'you'

Upvotes: 2

Related Questions