jayven huangjunwen
jayven huangjunwen

Reputation: 780

Efficient way to maintain a set of id <-> string relation in python?

Requirement: I have a big set of unique strings. I need to assign an unique int id to each of them. After that, either get id from string or get string from id should be efficient enough (both memory and speed).

If in c/c++, one can store these strings in a hash table (an array of const char * for example), and assign the index of a string in the table as the id.

Is it possible do the same thing or other solution in python? Otherwise I need to maintain two dict that map strings to ids and vice vasa.

Update: the set is frozend, no need to change.

Upvotes: 2

Views: 526

Answers (2)

jayven huangjunwen
jayven huangjunwen

Reputation: 780

Here is my solution now, using only one dict.

class StringTable(object):

    def __init__(self):
        self.table = {}

    def probe_id(self, ss):
        table = self.table
        id_ = hash(ss)

        while True:
            s = table.get(id_, None)

            # id_ not use, return it
            if s is None:
                return id_, False

            # id_ used and equal to ss
            if ss == s:
                return id_, True

            # id_ used by other string, probe again!
            id_ = id_ + 1

            # don't use this, endless loop, why?
            #id_ = hash(id_)

    def store(self, ss):
        id_, occupied = self.probe_id(ss)
        if not occupied:
            self.table[id_] = ss
            print 'store {%s: %r}' % (id_, ss)
        return id_

    def get_string_from_id(self, id_):
        ret = self.table.get(id_, None)
        print 'get_string_from_id %s -> %r' % (id_, ret)
        return ret

    def get_id_from_string(self, ss):
        id_, occupied = self.probe_id(ss)
        ret = id_ if occupied else None
        print 'get_id_from_string %r -> %s' % (ss, ret)
        return ret


st = StringTable()
# http://bugs.python.org/issue14621
s1 = "\x00\xcf\x0b\x96\x19"
s2 = "\x00\x6d\x29\x45\x18"

print repr(s1), hash(s1)
print repr(s2), hash(s2)

id1 = st.store(s1)
id2 = st.store(s2)

st.get_string_from_id(id1)
st.get_string_from_id(id2)

st.get_id_from_string(s1)
st.get_id_from_string(s2)

Output:

jayven@laptop:~/test$ python stringtable.py
'\x00\xcf\x0b\x96\x19' 1220138288
'\x00m)E\x18' 1220138288
store {1220138288: '\x00\xcf\x0b\x96\x19'}
store {1220138289: '\x00m)E\x18'}
get_string_from_id 1220138288 -> '\x00\xcf\x0b\x96\x19'
get_string_from_id 1220138289 -> '\x00m)E\x18'
get_id_from_string '\x00\xcf\x0b\x96\x19' -> 1220138288
get_id_from_string '\x00m)E\x18' -> 1220138289

Upvotes: 1

Sait
Sait

Reputation: 19805

If only string -> id is enough, just use hash function:

In [2]: hash( 'hello' )
Out[2]: 840651671246116861

In [3]: hash( 'helloo' )
Out[3]: -827725961091893887

If you need both ways, as @njzk2 suggested:

values = {hash(value): value for value in string_list}
# from id -> string:
values[id]
# from string -> id:
hash(string)

If you are cautious for collisions in hashing and your data is static, you can check if there are any collisions:

hashes = set()
for value in string_list:
   hashed = hash(value)
   if hashed in hashes:
      print('at least one collision in hashing')
      break
   hashes.add(hashed)
print('no collisions at hashing')

If you have any collisions, which is very unlikely, you can do:

myDict1 = {} # string --> id dictonary
myDict2 = {} # id --> string dictionary

counter = 0
for value in string_list:
   myDict1[value] = counter
   myDict2[counter] = value
   counter += 1

Upvotes: 1

Related Questions