Reputation: 780
Requirement: I have a big set of unique strings. I need to assign an unique int id to each of them. After that, either get id from string or get string from id should be efficient enough (both memory and speed).
If in c/c++, one can store these strings in a hash table (an array of const char *
for example), and assign the index of a string in the table as the id.
Is it possible do the same thing or other solution in python? Otherwise I need to maintain two dict that map strings to ids and vice vasa.
Update: the set is frozend, no need to change.
Upvotes: 2
Views: 526
Reputation: 780
Here is my solution now, using only one dict.
class StringTable(object):
def __init__(self):
self.table = {}
def probe_id(self, ss):
table = self.table
id_ = hash(ss)
while True:
s = table.get(id_, None)
# id_ not use, return it
if s is None:
return id_, False
# id_ used and equal to ss
if ss == s:
return id_, True
# id_ used by other string, probe again!
id_ = id_ + 1
# don't use this, endless loop, why?
#id_ = hash(id_)
def store(self, ss):
id_, occupied = self.probe_id(ss)
if not occupied:
self.table[id_] = ss
print 'store {%s: %r}' % (id_, ss)
return id_
def get_string_from_id(self, id_):
ret = self.table.get(id_, None)
print 'get_string_from_id %s -> %r' % (id_, ret)
return ret
def get_id_from_string(self, ss):
id_, occupied = self.probe_id(ss)
ret = id_ if occupied else None
print 'get_id_from_string %r -> %s' % (ss, ret)
return ret
st = StringTable()
# http://bugs.python.org/issue14621
s1 = "\x00\xcf\x0b\x96\x19"
s2 = "\x00\x6d\x29\x45\x18"
print repr(s1), hash(s1)
print repr(s2), hash(s2)
id1 = st.store(s1)
id2 = st.store(s2)
st.get_string_from_id(id1)
st.get_string_from_id(id2)
st.get_id_from_string(s1)
st.get_id_from_string(s2)
Output:
jayven@laptop:~/test$ python stringtable.py
'\x00\xcf\x0b\x96\x19' 1220138288
'\x00m)E\x18' 1220138288
store {1220138288: '\x00\xcf\x0b\x96\x19'}
store {1220138289: '\x00m)E\x18'}
get_string_from_id 1220138288 -> '\x00\xcf\x0b\x96\x19'
get_string_from_id 1220138289 -> '\x00m)E\x18'
get_id_from_string '\x00\xcf\x0b\x96\x19' -> 1220138288
get_id_from_string '\x00m)E\x18' -> 1220138289
Upvotes: 1
Reputation: 19805
If only string -> id
is enough, just use hash
function:
In [2]: hash( 'hello' )
Out[2]: 840651671246116861
In [3]: hash( 'helloo' )
Out[3]: -827725961091893887
If you need both ways, as @njzk2 suggested:
values = {hash(value): value for value in string_list}
# from id -> string:
values[id]
# from string -> id:
hash(string)
If you are cautious for collisions in hashing and your data is static, you can check if there are any collisions:
hashes = set()
for value in string_list:
hashed = hash(value)
if hashed in hashes:
print('at least one collision in hashing')
break
hashes.add(hashed)
print('no collisions at hashing')
If you have any collisions, which is very unlikely, you can do:
myDict1 = {} # string --> id dictonary
myDict2 = {} # id --> string dictionary
counter = 0
for value in string_list:
myDict1[value] = counter
myDict2[counter] = value
counter += 1
Upvotes: 1